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THE HUMAN GENOME 

A 2.91-biLlion base pair (bp) const v * sequence of the euchromatic portion of 
the human genome was generated by the whole-genome shotgun sequencing 
method. The 14.8-biUion bp DNA sequence was generated over 9 months from 
27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) y j r 
from both ends of ptasmid clones made from the DNA of five individuals. Two 
assembly strategies— a whole-genome assembly and a regional chromosome 
assembly— were used, each combining sequence data from Celera and the 
publicly funded genome effort. The public data were shredded into 550-bp 
segments to create a 2.9-fold coverage of those genome regions that had been 
; sequenced, without including biases inherent in the xlqning^nd assembly - 
procedure used by the' publicly funded grbup;^his brought the ^effective coy^-x > 
■/ erage in trie, assemblies to eightfold, reducing the number .and size of gaps in • - 

the final assembly over what would be obtained with 5.1 1 -fold coverage. The 
. two assembly strategies yielded very similar results that largely agree with ; 

independent mapping data. The assemblies effectively cover the^euchro^ 
\ Regions' of the human chromosomes. More^harv 90% of the ^genome is in 
V : : scaffold assemblies of 100,000 bp or more/ and 25% of .the genome;; is in ; ^ 
" scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed ; 
- 26,588 protein-encoding transcripts for which there was strong corroborating-/ 
v - evidence and an additional -12,000 computationally derived genes with mouse ; ; 
^ matches or other weak supporting evidence. Although gene-dense clusters are ^; 
obvious, almost half the genes are dispersed in low G+C sequence separated , 
by large tracts of apparently noncoding sequence. Only 1.1% of the genome : 
' is spanned by exons, whereas 24% is in introns, with 75% of the genome being > 
intergenic DNA. Duplications of segmental blocks, ranging in size up to chro- ;j 
• mosomal lengths, are abundant throughout the genome and reveal a complex 
evolutionary history. Comparative genomic analysis indicates vertebrate ex- 
pansions of genes associated with neuronal function/ with tissue-specific de- 
velopmental regulation, and with the hemostasis and immune systems. DNA 
. - sequence comparisons between the consensus sequence and publicly funded ' 
genome data provided locations of 2.1 million si ngle-nucleotide polymorphisms -r 
(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 

1250 on average, but there was marked heterogeneity jn the level of poly- 
: morphism across the genome. Less than 1% of all SNPs resulted in variation in 
' proteins, but the task of determining which SNPs have a functional consequence^ 

remains an open challenge. r V; 



Decoding of the DNA that constitutes the 
human genome has been widely anticipated 
for the contribution it will make toward un- 
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derstanding human evolution, the 'causation 
of disease, and the interplay between . the 
environment and heredity in defining the hu- 
man condition. A project with the goal of 
determining the complete nucleotide se- 
quence of the human-genome was-first for- 
mally proposed in 1985 (7). In subsequent 
years, the 'idea met with rnixed reactions in 



( A A using cham-teniiinating nucleotide ana- 
logs (3). In the same year, the first human gene 
was isolated and sequenced (4). In 1986, Hood 
and co-workers (5) described an improvement 
1 ."in the Sanger sequencing method that included 
■k. attaching fluorescent, dyes to the nucleotides, 
. which permitted them to be sequentially read 
v by' a computer. The first automated DNA se- 
■ = quencer, developed by Applied Biosystems in 
: ■ California in 1987, was shown to be successful 
'^when'the sequences'of two genes we're pbtained v 
with mis new technology (<5). From early.se-. 
• quencihg of human 1 genomic ^^bns: (7), ' it ; 
'■: - ; became clear that cDNA sequences (which are 
-^reverse-tra^ be es- 

fe^ential to annotate and validate gene predictions 1 
v .in the human genome. These studies; were the 
V basis in part for the development of the ex- 
(pressed sequence tag (EST) method of gene 
' ' identification (#), which is a random selection, , 
: : Very high throughput sequencing approach to 
■^characterize cDNA Ubraries:%The EST; method 
Z led to the rapid discovery and mapping of hu- 
v-man genes (P). The .increasing numbers ,of hu- 
: man EST sequences necessitated the develop- 
ment of new computer algorithms to analyze 
large amounts of sequence data, and in 1993 at 
The Institute for Genomic Research (TIGR), an 
algorithm was developed that permitted assem- 
v v bly and analysis of hundreds of thousands of 
;i ESTs.,:-This algorithm penrutted characteriza- 
tion and annotation of human genes on the basis 
of 30,000 EST assemblies (10). ^ i ' : 
\v,The complete 49-kbp.bacteriop^ 
-" "y-da : genome sequence was Vdetermined -:byla ' 
- ' shotgun - restriction 1 digest method in 1982 
(i7). When considering methods for sequenc- 
ing the smallpox virus genome in 199 1 (12), 
a whole-genome shotgun sequencing method 
was discussed and subsequently rejected ow- 
ing to the lack of appropriate software tools 
for genome assembly. However^ in 1994, 
when a microbial genome-sequencing project 
was contemplated at TIGlC a whole-genome 
shotgun sequencing approach was considered 



L scientific community W However, in % possible with the -TIGRJBST assembly algo- 



1990, the Human Genome Project (HGP) was 
officially initiated in the United States under 
the direction of the National Institutes of 
Health and the U.S. Department of Energy 
with a 15-year, $3 billion plan for completing 
the genome sequence. In 1998 we announced 
our intention to build a unique genome- 
sequencing facility, to determine the se- 
quence of the human genome over a 3-year 
period. Here we report the penultimate mile- 
stone along the path toward that goal, a nearly 
complete sequence of the euchromatic por- 
tion of the human genome. The sequencing 
was performed by a whole-genome random 
shotgun method with subsequent assembly of 
the sequenced segments.^ 1 ~ 7 

' The modem history of DNA sequencing 
began in 1977, when Sanger reported his meth- 
od for detenriining the order of nucleotides of 



rithm. In 1995, the 1.8-Mbp Haemophilus 
influenzae genome was completed by a 
whole-genome shotgun sequencing method 
(13). The experience with several subsequent 
genome-sequencing efforts, established the 
broad applicability of this approach (14, 15). 

A key feature of the sequencing approach 
used for these megabase-size and larger ge- 
nomes was the use of paired-end sequences 
(also called mate pairs), derived from sub- 
clone libraries with distinct insert sizes and 
cloning characteristics. Paired-end sequences 
are sequences 500 to 600 bp in length from 
both ends of double-stranded DNA clones of 
prescribed lengths. The success of using end 
sequences from long segments (18 to 20 kbp) 
of DNA cloned into bacteriophage lambda in 
assembly of the microbial genomes, led to the 
suggestion (16) of an approach to simulta 
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neously map and sequence' the. human ge- 
nome by means of end sequences from 150- 
. kbp bacterial artificial chromosomes (BACs) 
(/ 7, 18). the end sequences - spanned by 
known' distances provide long-range continu- 
ity across the genome; A modificafion'of the 
h^BAC end-sequencing (BES) method was ap- 
\: implied successfully to complete; chromosome 2 
..from the Arabidopsis tHaliana genome (7P). 
0 ; if ;v^;In 1997,;Weber and Myers (20) proposed - ; 
^■^xwhole-genome ': 
;^human gerioine. v Jlieir;proposai .was not well 
;;receive<i (27): However, by;early^l998/as 
. ; viless - than : 5% of ,the. genome -had. been , se-; 
; ., quenced/it w^' clear, that t^e 'rate of progress 

■tv ;:iri Whitman ^genome' 'I'p^^^i^!?^^^ 
"y .-:was very slow' (22)7 -and prospects for: 
finishing the genome by the 2005 goal were 
uncertain. . . • - < 

In early 1998, PE Biosystems (now Applied 
: Biosystems) developed an automated,* ;high- 
, . throughput capillary DNA: sequencer, . subse- 
quently called the ABI PRISM 3700 .'DNA' 
. Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to.under- 
. i , take the sequencing of the human genome with 
'■ I : : ; .the 3700 DNA Analyzer and the whole-genome 
. . -shotgun sequencing techniques developed at 
; > K TIGR (23). Many of the principles of operation 
of a - genome-sequencing facility ; were \ estab- 
lished, in the; TTGR facility; (2<).\Howeyer/ the 
facility envisioned for Celera - would .have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the/f. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. • 

These findings, together with the dramatic 
changes in the public genome effort" subsequent 
to the formation of Celera (29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to —5-fold 



Methods 

f : Summajy.'J\ds. section discusses the rational ^ 
-and ethical rules governing donor sclcciion 



• r - coverage md'to'use;the\i!nord^rBd and unori- ; ;? ^^^^ ^ D ^ * nd fencing 
' V ented B AC sequence' fragments and subassem- 
; : blies published in GenBank by., the .publicly 

funded genome effort (50) to accelerate me ^ 

project ' We also /abandoned the quarterly an-,^>v.ensure;ethnic and gender diversity along wjth 
^nouncements'-in the : absence of mterim assem-^the. methodologies for PNA -extraction ^ 
flies' to'report. v \ ?A'--''4 : -/:vf library;' cotv^t 

;;:|>k^Umbu^ "step ;in sho^guir? 

able result very' early that was consistent v with a & :- ^sequencings If the DNA binaries are hot uni* 
^whole-genome; shotgun v assembly Vwi& )eight-;;^ form size,- nonchimeric, and do not randomly'" ' 
fold coverage;^ then the subsequent step* % 

*5xnot as finished as ibe' Drosophila . genome was-^xannot accurately reconstruct .the genome ;ms^ 
-v^with aii effective 13-fold coverage;. However, it .quence. ^We used automated? high-throughput 
■ became clear that even withthis reduced cov : ., . :DNA sequencing and the computational infra.'*: ■ 
^ rerage stttegy^Celeraicpuld generate an, accu-.^-^-struchire of cnor* 

>;?:rately ordered arid oriented scaffold sequence of :^ mous amounts- of- sequence ^ mformation (27.3 



Hthe human genome m les's;man>l year/Human 
■ * . genome sequencing was initiated 8 September, 
;., -1999 / and completed 17, June 2000. ;The first 
assembly was completed 25 June 2000, and the 
asserhbly reported here was completed 1 Octo-* 



;million - sequence: reads; .14.9; billion bp of 
quence).-- Sequencing >and trackirig from both : 
i ; ends of plasmid clones from 2-, f 10vand 50-kbp 
\libraries • were\essential to .the - computational ' 
reconstruction of the genome; Our/evidcnccV 



ber 2000. Here we describe the 7 whole-genome indicates that, the accurate, pairing .rate pfxnd 
: '.random shotgun sequencing -effort applied to sequences was greater than 98%. : - 5 -r < •■ - V ; -v 



the human genome. We developed two differ- 
. ent assembly approaches for assembling the ~3 ; 
; ;.billion bp that make up the 23 pairs of chromo- ,r 
- somes of iht'Homo sapiens genome. Any. Gen- 
: Bank-derived .data were shredded to remove ■_■ 
:■£■ potential bias : to : the : final sequence . from ' chi- : .. 
•'meric .clones, foreign iDNA contamination, or 



Various policies of the United States and the . 
\World Medical Association; specifically lite;; 
.^Declaration of Helsinki, offer recommchda^; 
• tions for conducting experiments with human 
; ;subjects;r.We convened. an vlnstitutional Re-.; 

iview > Board. (IRB) (31) that helped us: estab-v 



niisassembled : contigs.': : Insofar :-a^ i*.cbnect^ 



and .accurately ; assembled / genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe.our _ 
preliminary analysis of the human genetic 
code on the basis of computational methods.-; 
Figure 1 (see fold-out chart associated with v 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/29 1 / 
5507/1304/DC1) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

1 . Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation .* 

4 Genome Structure 

5 Genome Evolution--* - - -< : 

6 A Genome-Wide Examination of 
Sequence Variations . 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



man DNA*arid the informed consent process, 
used .to -enroll - research ^volunteers -for the ■ < 
DNA-sequencing studies reported here. Wc 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure , ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the subjects .by. researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRB believed that the ini- 
tial version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, , on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., African- American, Chinese, 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (32). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential . code to the donated sample: age, 
sex, and self-designated ethnogeographic 
group. From females, -130 ml of whole, 
heparinized blood was collected. From males. 
-130 : ml of whole, heparinized blood was 
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■ collected, as well as five specimens of ser^ 
: collected over a 6-week period. Permatiwit 
lymphoblastoid cell lines were created by 
, Epstein-Barr virus immortalization. . DNA 
: from five subjects was selected for genomic 
/".'DNA sequencing: two males and three fe- f 
males — one African-American, one 'Asian- 
''; ; • ^ Chinese, one Hispanic-Mexican, ; and two 
Caucasians (see Web fig. 2 on Science Online 



The Human Genome 

the dideoxy sequencing method (35), which 
typically yields only 500 to 750 bp of sequence 
. per reaction. This limitation on read length has 
made monumental gains in throughput a pre- v 

requisite for the analysis :of; large eukaryotic -supported by a quality. control team that per- 
genomes; We accomplished this at the Celera ; /-formed raw. material .and ; in-process testing 
facility, which occupies labout 30,000 square > -. and a quality assurance group with responsi- 
' feet of laboratory space and produces sequence ; bilities - including ^document j control, valida- 
-;data continuously /at a. rate/of .175,000 total Virion, and auditing'.of me facm^/critical to 



.rough the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 



F at . ^www.sciencemag.org/c^ reads per day: The DNA-sequencing faciHty is ^; the success of the.scale-up was the validation 

13C^Cl).-01ie idecis^ 



v. * ■ 



-^sequence .was based on a complex mix of fac^:^ 

<> V.tors, mcluding the goal of achievmg diversity as ^^The process'fbr DNA sequencing was" rnoi-^bf anyprocess change/ < -li r ~ ' ' ' : ' 
^swell as technical issues such as the. quality of Vular by design and automated^Intermodule . / " C : \. r T y/^.v" ^ " 

g^ythe DNA Hbraries and availability of immortal- ^Jsample ^backlogsWal lowed :fpur principal .2 Trace' proc^si hg .. 

;|^ize^^ 

i i I'-k^'' '" r * y ' 3 v**j^ colony ;v<vbeen developed to process each sequence file 

^ seauenclri ™™ " ^- ^ ^ .quaiity an\i vectori^irmiiing; the 

'j. ^ g ' '- : rr ' ■: : ; " /; - ; (iii) dideoxy^sequencing reaction set-up r> average, trimmed- sequence .length", was 543 

Central to the whole-genome shotgun sequenc- I and purification; ^ and ; (iv) sequence .deter- .*H -bp;,: and the /sequencing accuracy Ws expo- 
sing process is preparation of high-quality plas-A ; r'mination withthe ABI PRISM 3700 DNA ; 4%nentially 'distributed .with a mean 6199.5% 
y inid libraries in a variety of insert 'sizes sottat vv^alyzer^^ 

• pairs of sequence reads (mates) are 'obtained, of --each : module • have : been . f carefully v ; than ; 98°A 'accural : (26)?Each trirnrhed se- 
V i one read from both ends of each plasmid inserts , ^matched and sample backlogs are continu- -i ', quence was screened for matches to contam- 
;:- High-quality libraries have an equal representa- : , ously managed, sequencing has proceeded v : inants including sequences of vector alone, E. 
7 tion of all parts of the genome, a small number - without a single day's interruption since the - co/rgenomic DNA; and human mitochondri- 

^initiation of the Drpsophi la project in May ,V :.al DNA. .The entire Tead, for any sequence 
1999/The^ABI 3700 : is 'a l ;fully automated ( -^with'a- significant match to a contaminantjwas 
/capillary array sequencer and as such can 
be operated vwith a vminimal amount -of 
x hands-on time, currently estimated at about 
15 min per day. The capillary system also 
/ facilitates correct associations of sequenc- 
ing ; traces -with samples through the elimi- . 
•isystem that could be implemented in a robust -if;' nation of manual sample -loading and ■iane^^fcuracy/of .the sequence data; increases 'as the 
| %and -reproducible manner and monitored ef :. ;. bracking Errors ;assobiated repetitive Wrure'of^he'g^ 

1 fectively (Fig. 2) (34).'"": "' " / ' -/-About 65 production :staff were hired : and ; • be • seq^ 

v -Current sequencing protocols are based on- ^trained,- and were rotated on a regular basis ^ - read ' must be -placed .uniquely Jn ; me;ie- 

* * * * ■* 

Table 1. Cetera-generated data input into assembly. ' - 



of clones .without inserts, -and no "contamination 1 , 
;> 'v : froni such sources as the mitochondrial genome V 
■ and Escherichia coli genomic DNA. DNA from 

■ i each donor was used to construct plasmid librar- 
! -i : ies in one or more of three size classes: 2 kbp, 10 

■ kbp; and 50 kbp (Table 1) (33). 

1 s . In "designing ' the : DNA-sequencing pro- 
t . 'cess, ' we focused on ; developing a \ simple 



: . discarded. :A total of ; 713: reads matched. E. 
vco/i genomic DNA and. 21 14/reads matched 
>* the human mitochondrial genome.; vJ 4 . , ; y 

1 .3 Q u ali ty assessment and , control 

The importance of the ^ase-pair: level ; : ac-- 



No. of sequencing reads 



Fold sequence coverage 
(2.9-Cb genome) . 



Fold clone coverage 



Insert size* (mean) 
Insert size* (SD) 
% Matesf - , 







Number of reads for different insert libraries 




Total number of 


Individual 










2 kbp 


10 kbp 


50 kbp 


Total 


- base pairs - — 


A 


0 


0 


2,767.357 


2,767.357 ♦ 


1,502,674,851 


B 


11,736,757 


7,467,755 


66,930 


19,271,442 ' 


10,464,393,006 


C 


853,819 


881,290 


0 


1,735,109 


* 942,164,187 


D . 


952,523 


1,046,815 


0 


1,999,338 


1,085,640,534 


F 


0 


1,498,607 


0 


1,498,607 


813,743,601 


Total 


13.543,099 


- 10,894,467 


. 2,834,287 


27,271,853 


14,808,616,179 


A 


0 


.0 


0.52 


0.52 




B 


2.20 


. 1.40 


0.01 


- ,3.61 




C 


0.16 


: 1.17 ; 


0. 


- 0.32 




D 


0.18 


0.20 


0 


0.37 




F 


0 


v . 0.28 


. 0 


0.28 




Total 


2.54 


" " 2.04 


0.53 


5.11 




A 


0 


. . . . . o 


18.39 . 


18.39 




B . 


2.96 


11.26 


0.44 


14.67 




C 


0.22 


1.33 


0 


1.54 




D 


0.24 


1.58 


0 


* 1.82 




F 


0 


. 2.26 


0 


2.26 




Total 


3.42 - 


16.43 


18.84 


38.68 




Average 


1.951 bp 


10,800 bp 


50,715 bp 






Average 


6.10% 


- - 8.10% 


14.90% 


■ : i . . 


- j- 


r „ Average i 


74.50 


80.80' 


J 75.60 




I , ■ t 



•Insert size and SD are calculated from assembly of mates on contigs. f% Mates Is based on laboratory tracking of sequencing runs. 
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THE HUMAN GENOME 

: nome'- and even a modest error rate can . entire human genome ^in'alshi^ - 

r^c^ S^ In : we were . able to ensure unifbrnt;quality>^nd a^ ^ clu^nngall of&e fiag- : 

addition, maintaining : the validity of mate- 7- standards; and the cost advantages associat-^^- ments to a region or chromosome on the. basis i% 

nair information is absolutely critical for! : ed with automation, an econonvy :5 f sc^ data • 

pair t iniormdiiuu, ia ; au Jt . . ^ :„:„V™;„ . r ^ ' : r;:.7 

the. algorithms described below. : Procedural 
(//controls ..were; established 
.Pthe:^alidity : of^ 
;q^uencing reactions - proceeded through 



v/> Human Samples 

-y [Medical Affairs] *" 



' 1 r Hi f r.(- 



i 



-sample screening 



'^Tissue Samples 

■ , -[DNA Resources] 




^w6h<fi6w 



QA Process 




... t 



' V r V ii,-. 1 -i .-*■ ..*"vr* >- j ■ ' • - t ■ i ■ ' 1 'f\ f '•_ ■ '-. fj 



'••+.■ -V; 



SaDAW Resources} 



- ' 1-* ^ V," 



: ^:\i[DNA Resources] 



^ DNA/RNA (External) v. 

-■' 'i -\: [DNA Resources] . 


, QC: size & concentration , i- 

.. j 


„ ■ . f 
* 











^i^^^/^M^^Ag ^QC: Insert size, ^y;-; 



■ Libraries . r , 

7DNA Resources] t / - ' 



r. 



ffDNABesburces] 



- - r 1 



Fluorescently Labeled 
DNA 

[Pre-Sequenctng Lab] 



QC: monitor statistical 



^^Libraries // 
[DNA Resources] 



; FiuoresceritlyXabeled 

^^••:-i/^;'DNA^/^/^^ 
^■{Pre^equencfng Lab] /C 



, .-, , T r*-r^ — uv^; monuur &idus 

/»iSp^^ M summa^data 



Trace Files [UNIX] 

[Sequencing Lab] 



-validate trace files 
- load QCDS quality info 



iisfctaAi' 



External Fragments 

. [Content Systems - EDA] 



QC: byte count, 



External & Trimmed 
Fragments - 

[Content Systems] 





























wit 





Proto I/O Files 

: [Content Systems] 




Trace Files [NT] 

[Sequencing Lab] 



Trimmed Fragments 

1 ".u. [Content Systems] 



QC: "gatekeeper" 
syntax, duplicates & 



as^cps^t-i.v*.S^^^^^S -* - 

il/Q File Geheration # quality values 



Proto I/O Files 

[Content Systems] 



■ J ^ 1.. 



1 




Assemblies 
; pR/cj] 



;1 



W ^ Jy ^ .- r^ V. , 



Fig. 2. Flow diagram for sequencing pipeline. Samples are received, 
selected, and processed In compliance with standard operating proce- 
dures, with a focus on quality within and across departments. Each 
process has defined inputs and outputs with the capability to exchange 



samples and data with both internal and external entities according to ; 
defined quality guidelines. Manufacturing pipeline processes, products, - 
quality control measures, and responsible parties are indicated and are 
described further in the text 
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and provide a comparison to the public gen'^ : 
sequence, which was reconstructed largely '. J' 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes! More than 
,90% of the genome v was in scaffold assemblies 
"of .100,000 b^6r'S:^; : ^2^(ii^'g^ 



The Human Genome 

* *«".*' . . ■ _ 

2.1 Assembly data sets 

We used two independent sets of data for our 
assemblies. The first was a random shotgun 



( pences. In the past 2 years the PFP has 
- iocused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 



data set of 27.27 million reads of average length 

543 ,bp~ produced at>Celera.^This 'consisted ; ufrom a 3X to 4 K light-shotgun of each BAC 
..largely, of mate-pair breads from 16 libraries ■> clone. '-<••- « .., J„; • 

^ponstmcted from V'- We scraned/Ae bactig sequences for con- 



^nome was m. scaffolds of 10:million bp or . different donors. Libraries wifc by j using the BLAST algorithm 

larger. -." : . 1": Jt -.*' : ' ,W Y** J ^ v ': m nnH vKn" 



;^h6tguh'^seque^ 
>; example of an' inverse j>f oblem; given a set 
of ;reads"ra 'from; a /target 

: Jsequencei reconsVruct the order Wd the pb- 



10, and 50 kbp were used By ; lbolqng at'how \f against three data sets: (i) vector sequences 
mate jpairs core (55), -filtered , for a 25-bp„ 

-known sequenced stretches i of the .genome, ■ we match* at ]9 8% sequence identity it the !*eiids'; 
V. werejable;tp^ich and a : 3Q-bp, match -internal, 

'.^sizes. in jeachhbrar/ the ^quehce;v(ii) the^bniumanypbrtio 

Tstaiidard deviation; Table 1 details the number -Pf-of the HighjTto^ 



tedphua nave now .Deen exrenaea to assemoie ^.'.erage is' the coverage of the genome m cloned redundant nucleotide sequences vu 

j me ^25-fbld larger hur^genbme:^ insert 1 of each : ; - ^Bank without primate and humanVvkus eh- 

rsemblies consist of a 'set of contigs' that are ; v.clorie that has' seqiience from both ends: The ■ & tries, filtered at 200 bp at 98%AWhenever 
I ordered and oriented into scaffolds that are then :1 clone - coverage provides :a - measure - 'of --(he 25 , bp or more of vector-was found within 
rmapped to chromosomal .locations . jby^ using Wamount of physical rD NA coverag e of the je- m50 bp of the end pfa^contig,-the^ 



1 known' markers^ The contigs consist of a; col-; 'v;nome; Assuniinga genome size of 2.9 Gbp; me * 



lection of overlapping sequence reads that pro- 

7 vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 

^ central component of the assembly strategy. 

; They are used to produce scaffolds in which the 

■ ■•size "of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
complished by observing that a pair of reads, 
■one of which is in one coritig, and the other of 

> which is in another, implies an orientation and 
distance between the two 'contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 

i. reads into the iinal set of reported scaffolds. 
This ' set of lmmcorjporated reads . is termed 

-•"chaff,"- and typically consisted of reads from 
wthin highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 



Celera trimmed sequences gave a 5.1 X cover- 
,< age of the genome, and clone" coverage was * 
3.42X; 1640X, and 18.84X for the 2-, 10-, and 
r 50-kbp * libraries/ respectively, for a total of 
\ 38.7X clone coverage. *^ y V ;V 
■ • v; : t The second data set yras from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (30). The 
; ; BAC:data input.to the assemblies came from a 
^. download of GehBank on 1 September 2000 
(Table 2) totaling 44433 Mbp of sequence. 
: .The data for :each BAC is deposited at one of 
four levels of completion. IPhase 0 data are a set 
■^of .\generally^una^embied2seq^ = reads : 
• Aftom a very light shotgun of the BAC, typically ?. 
■ less than IX. ! : Phase ^ ;1 dataware 'unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 



Mapped 
Scaffolds: 




STS 
4-^ 



i Genome 




Scaffold: 



Read pair (mates) 



Gap (mean & std. dev. Known) 



Contig: 




Consensus 

Reads (of several haplotypes) 



• SNPs 
— BAC Fragments 

Fig. 3. Anatomy of whole-genome assembly. Overlapping shredded bactig fragments (red lines) and 
internally derived reads from five different individuals (black lines) are combined to produce a 
contig and a consensus sequence (green line). Contigs are connected into scaffolds (red) by using 
mate pair information. Scaffolds are then mapped to the genome (gray line) with STS (blue star) 
physical map information. t . 



the matching : vector twas^ 
•Xthesefcriteria we removed 2.6 Mbp of pos- \ 
' sible contaminant - and - vector -;{from the 
:Phase 3 data, 61.0 Mbp from the Phase :! ', 
~- and 2 data, and 16.1 Mbp from the Phase.O 
data (Table 2). This left us with a total of. 
4363.7 : Mbp of PFP sequence data 20%. 
.finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
; Ah additional 1 04,0 1 8 B AC . end-sequence - 
mate pairs were also downloaded and in- 
cluded in the data sets for both r assembly ^ 
processes (18). 



1 2.2 Assembly strategies 

Two different approaches to assembly : ; were V 
pursued. The first was a whole-genome as- ; ; 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets ; 
localized to large chromosomal segments and- 
. then performed ab initio shotgun assembly on . 
each set. Figure 4 gives a schematic of the 
overall process flow. 

For the whole-genome assembly, the PFP 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering o£the bactigs. This 
resulted in 1 6.05 million "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy in the BAC data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
43.32 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled (40). Furthermore, BAC location 
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■;a" iHs.rfs ^i.f^? :V; c>' 



T hVhum'an;iCeno m e 




^« I' 




n 



1 



Baylor College of 
.Medicine; USA 



' v. 



'0 
0 
0 
0 



' ^ £';were.nqt cprre&y MW^P 5 ^^ ^^^^i ^S^^^l^^^^ra^ ; ; ^took ; ^e}ex^^ri^pf deriving additiprwl 

V ^rn^.^b^^ .... ; . .^ence average, but riot mate pafc, assembled 

WKM ; u ,-.,.*"Vi;>"'' ; ■ ;.' > ;---V »'.t,'.V.';VV>'iS !/',.-/-iL.di '; : -irr^vj^kfe'. ; .y.,''..Kii;- r; '■ : . bactigsj-or genome locality, from some cxtef. 

~|n i VV';Tabte 2. CenBank data input j nto aaembly ;- ^V,,,,^ V^^^U^a sx : .ii--V.'.^v : .^iiy grated data.' - ; W ^ ; 

• X;'.'. • •'■ '• — ■ .,■■>,■ ^ .A-pr^ with - 

" "-• ce/and Aeri shotgm assembly was np- 
; each' ^ partitioned Vsiibset ; "wherein the i 

• ioSi cbnSmW masked V^ias^' / W17.055 - 98.028 ^ to independent ab miteassernbly of ;• 

v L) -~ .'U -S"- v". ' 'i^o - i^ .H?^r?j-»7^ the data in ..tl^.. 

Average contig length (bp) ^V! v; J ■ U&7» ■ ■ v ; 7,853 ; 134.5lb ; ; — ) me ove M - computational 'effort was re- ^ 

; - . : ". " ' ' ■ . .' : ^ ; ' ! ;.' '. . v .• ■ /, . ■ ., . r iq :\C: ^r.323Z --i r^;^J:1.3O0 •> v-ife'eed arid the'effecf of bterchromosbmal dupli- 

^-Wa^g&nUnivers^ 

r-"&: n USA > ■ Sase palrs^ ' ■ V->--.v ; '^ 1.195.732^ ;561.171.788...164.214.395 :- bnstruction of the genome that was relatively 

Total vector masked (bp) .* L "c ^ : '.^ ^: ^gg-^ v fodependent of me whole-genome assembly re- • 

Total contaminant masked :.. . ; ... v 22.469^ ^ ' , ; . • ^ so that the two assemblies could be com- 

9,079 • .126.319 vpared for consistency. The quality of the parti- 
1626 v < : 3 63 iv tioning -into icomponents .was xrucial so thai ; . 
:■ •• 44861 363 .different genome regions were not mixed to- 

265.547!o66 49,017.104 , . ge ther: We constructed components from (0 ihc 
. . '218.769 . , . 4,960 .. .v.j^ gst f SC aff 0 ids ! .-of the .^sequence froni. caeli 
.. 1.784.700 ;. , 485.137 > : , BAC .^ 0 i ([[) assembled scaffolds of data unique 

"•' ■ Q -.-J „Vo33 • to Celera's data set The BAC assemblies vverc 
5.919 ; : 135.033 .. . ^^ fey a . combming assembler that used tin: 

: 2.043 . ■ ■ ■■■ . 754 ; :bdcti ^ ^ me 5 x,Celefa data mapped to those - 
^6265 '-^vS^ 

^ 4;64t372 ^ 1 18387 ^,and cornplete me^ffbld for; a given jequcp" 
j • stretch/the more'accurately one can tile these 

8.422 80.867 , scaffo i ds into contiguous components on Hie 

basis of sequence overlap and mate-pair inr«- 
mation. We further visually inspected and cu- 
rated the scaffold tiling of the components to 
further increase its accuracy. For ; Ao. final WA. 
, assembly, all but the partitioning was ignored. 
- 599 r „ and an independenV ab initio instruction pf 
S ^the sequence in each component was obtam^ 
bv aoplving our whole-genome assembly a\yo 
SSSpM relevant Celera da a >md 
the shredded, faux reads of the partit.oncd. 
evant bactig data. ■ 

2 3 Whole-genome assembly 

The algorithms used for whole-gcnomc ns* 
sembly (WGA) of the human genome wj 
enhancements to those used to produce « 
sequence of the Drosophila genome rcportc 

in detail in (28). .' ' r J ninclim 

The WGA assembler consists of np^ 
composed: of five principal stages: Serve 

Resolved respectively. Hie Screene fln^ 
and marks all microsatelhte repeats ' 
: than a 6-bp element, and screens .oul. 
known interspersed repeat e lements. ^ 
ing AIu, Line, and ribosbmal DNA. M rK 
regions get searched for overlaps 
screened regions do not get searched- W 
be part of an overlap that involves unscrew 
matching segments. 
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Number of accession records 
Number "of contigs . 
Total base pairs' 
....... Total vector masked (bp) 

•? : *■ -t . ^ . j - '-.'i - 1 : J- Total contaminant masked ■.: 

" (bp) ■ :"■ 

Average contig length (bp) . 

.Production Sequencing , Number of accession records 
/ Facility, DOE Joint Number.pf contigs : s . ; ■ 
Genome Instituted /Total base pairs : , v^/^-- 
USA • V.< ; Total vector masked (bp) ■ 
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, (bp) \ ' - 
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and Chemical Number of contigs 
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5.564,879 
57,448 
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shredded Into faux reads resulting In 236X coverage of the genome. _ ^ .. . . 
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the Human genome 



The Overlapper , compares every i 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more^than 6% differences, in the match. 



:>.*../ 

J. 



inn £ Tnn k P • DtS 34,(1 0ther and *us more likely to make a 

?}?&#3f. Wye segments. ;. ■ .mistake. For the human assembly, we contin- 

• The result of runnmg .the' Unitigger .was-^ ued to use the first "Rocks" substege where 

Because .all data are scrupulously -vecto>tTcc^i^S; ^^^^^m ^m^^ ^ definitive; 
trimmed, the. Overlapper can u^^^^^S-^St^ 9 ' J ST? o- ^^toV. score ,are placed in a scaffold 
; plete overlap n^tche^Co^u^s^ ^SwSSSorSS? ' • ^^J^i^^^^>^ that 
:^oveibp8.tobk lOugMy-lO MO CTU hours -^ ; SeHL^"^T^^^ : ^^^> t,y0 ^^ie^-fpairs with one of their 
with a suke W fb4rocSlg"S V£^SL^^ 

■ with 4 gigabytes of RAM'-Tnfc-tM^ uniti S m &e given gap. We estimate 

idaysnSpsS^^ 

^operat m y m parallel.- 

, Eyeryov^lapcompUtedaboveisstat^ 
tally^^ 

the sequence relds shoW 

gether.even more overlaps are actually from 4 Sna^ mate-pairing , 

two disttact copies of a^ow'ipf^S^^ 99% **• i 

element not screenedabove^^ 

in the process P ' P " ? ly e " ly ^ ;v - 

genome , ., the assembly. This operation proved much 

y Foi ths Drosophila assembly, we engaged ; ; - more reliable than the one it replaced for the 
in .a. three-stage repeat .resolution • strategy vsHDros'ophila assembly; 'in the assembly of a ; 
where, each .stage .was progressively .more ^simulated shotgun data set of human chrome- • 



in the process. 

We .achieve this objective in the Unitig- \ 
ger. We first find all assemblies of reads that ... 
. appear, to be uncontested with respect to all : 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
;. sembled contigs). Formally, these unitigs are 
:> the uncontested interval -subgraphs ...of. the!, 
/ graph of all overlaps ^ 

though empirically many of these assemblies 
: are correct (and thus involve only true over- 
laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- . 
erage coverage depth" is too'hTgh to~be con- 
sistent with the overall level of sequence : 
coverage. We developed a simple statistical 
discriminator that gives the logarithm of the 
odds ratio that a unitig is composed of unique ' 
DNA or of a repeat consisting of two or more 
copies. The discriminator, set to a sufficiently 
stringent threshold, identifies a subset of the 
unitigs that we are certain are correct. In 
addition, a second, less stringent threshold 
identifies a subset of remaining unitigs very 
likely to be correctly assembled, of which we 
select those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that are >2 kbp 
long. We are further able to identify the ; 
boundary of the start of a repetitive element : 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 



5.1 1X Cetera Reads 
r 39X mate pairs 



Public Bactias 
(from i 33.421 BACsj 



2.96XFaux 
Reads 




WGA 




Bactigs & Cetera pairs 
\jbinned by BA C) 



Combining 
Assembler 




BAC 
Scaffolds 



<5 



WGA+Shredder 



J 




Components t 




Components 2 




■ • • m 
' m 

Components^ 





WGA Assembly CSA Assembly . ; . •_• 

Fig. 4. Architecture of Celera's two-pronged assembly strategy. Each oval denotes a computation 
process performing the function indicated by its label with the labels on arcs between ovak 
describing the nature of the objects produced and/or consumed by a process. This figure 
summarizes the discussion in the text that defines the terms and phrases used. 
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, -isome 22, all stones .were placed wnectlyi'^^ivjiaye. required a -computers exponential. 
. -^-^5 -^^Tlie" 'final "method of resolvmg"gaps ; is^t6;^^ all gaps were less than 500 

.". ni^fill .them with assembled BAC data that coyer ^ incremerital/vveiwere able to achieve the. same ^ bp long/>62% of all gaps were less than 1 kbp . 
. , :;:;^:the gap. -We call this external gap talking." computation with a maximum of instantoeous^^'i-long.'and no gap was MOO, kbp long. Similar- , 

X\ >Wedid not include the very aggressive *Teb-^>^ usage of 28 gigabytes of RAMjMoreover/the^ly^more than '65% of the sequence is in contigs 
\ : .. * ; : bles"; substage' described ; in our Drosdphila ^ ^^incremental nature of :the first three* stages al-;i^f >3 0 kbji, mbre'.fhan ■ 3 1 %, is . in contigs > 1 0Q 

^^work^-which made 'enough mistakes"so f as;to^ ' , 

. S P rc [duce repeat reconstmctions . for long 
v. ^%VCspersed .eleme 



i:^-*to introduce, a.step' that was certain to produce ^compute infrastnicture consists of 10 four-pro-' 
" : - ^.^^less ,than : 99.99%^ accuracy.-The cost/was . a ^cessor -SMPsTwith- : 4 gigabytes :of memory per A?*2.4 Cpmpartmentalized shotgun ^]' ; . . 
^f^somew^ M a;16-';; - , ass ^ rT ! Dl y - v ' . ' K 

" :i'v2: what ^ larger size.* ^-v^V . ^^^■. ! ^:-.;piocessor NUMA,machme'. : \w >^,In /addition ..to/ie^WGA^apprpach, ; r we ;puf-' 
- '; 't > At the final stage of the assembly pr6cess; ; ^ ; ;4q^me^ GS 1 60j) WUdfuj). ^ The ,,^£sued a 'localized Wsembly Approach that was 

. • fe^and ;-also.V atlseveral ^mpute it for^;i^ il^. intended •tp"'subdivide.!the^geri^ 

"yV^v ^ is'prp- ,V^ roughl^ t ■ ;y)-^mehts^each=6f^ be shotgun as- - 

: / i- : I duced. Our algorithm is driven by the princi- -^,'4 The ^assembly " : of Celera's ridata^ ^together " • /sembled , indiyiduaily.* '.We . expected , .that this 
V I • pie : of 1 maximum' 1 pareimoriy, ^vith 'quality-; ^. ^ with the shredded bactig da^'pr^uced a "set of - v^would help in-f esolution bf large interchro- 
/ : 'y value-weighted measures for evaluating each -^scaffolds totaling 2.848 IGbp in 'span and con- f - 1 * mosomal duplications and improve tiie statis- . 
. • ■ 'base. The net effect is a Bayesiah estimate of .^r-sisting of 2.586 Gbp of ^ sequence ;The chaff, or ]. :Vi ties, 1 for" calculating Uruhitigs. -.The cpmpart- 
the correct base to. report/at each pbsitipn.^ 
C Consensus generation uses Celera data when-, ^ ^{numbered 1 1 27 million (26%), which is con- t -v ■ tering Celera reads and ;bactigs into . large, 
ever it is present. In the event that no Celera ^sistent; with^ ^ bur ; experience for sDro^o^Ma. multiple megabase ;regions of the genome, , 
- data cover a given region, the. .BAC/ data. l^More;^ fan i running the WGA. assembler on the ! 
. :■ ■ ^sequence is used.'. ; - : Ji^is^:>^i scaffolds :> J 00. kbp longj and these; averaged • - . Celera 1 data and i shredded, - fau^^ads /ob^ 
' : ■ > A key element of achieving a WGA of the . 1 91% ^sequence " and " 9% gaps with a total of tained from the bactig data. . . T 
v .human genome was to parallelize the Overlap- ■•,2.297 Gbp of sequence. There were a total of : ; ;The first phase of the.CSA strategy "was to =. 
. per ' and the central consensus sequenceTCon-..; 93,85.7 : gaps among ■ the >1637, scaffolds > 100 /. . /separate Celera reads .into '^ose.&at matched ^ 
. Jv^stnicting^^ memory :y/as^ ^-J^ 

. ^ . V a "real issue— ^a sti^^tfon^fd " application Jof ^ ':: ; .y,the 'average cbntig; size was 24.06 kbp, arid the f 5 entry, and those^thaf <ii d not .match" any public^ 
.." ... ; - , the software we had built for DrosophUa would ;f average gap size was 2.43 kbp, where' the. dis-. . •"data. - Such matches ;mu^ > 



Table 3. Scaffold statistics for whote-genome and compartmentalized shotgun assemblies. 









Scaffold size 
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All 


>30kbp 


>100kbp 


; . >566kbp . . 








Compartmentalized shotgun assembly 






, ^.No. of bp In scaffolds . . 


. . 2,905,568,203 


2.748,892,430 : . }: 


>2,700,489 ( 906 : 


. -2,489357.260 ' 


2,248,689,128 | 


(including Intrascaffold gaps) 
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. No. of bp in contigs 


2,653,979,733 


2,524.251,302 


2,491,538,372 


2,320.648.201 


2,106,521,902 | 


No. of scaffolds 


53,591 


2.845 


1,935 


, 1,060 


.; 721 : 


No. of contigs 


170.033 


112,207 


107,199 • 


93,138 


82,009 * 


- No. of gaps 


116,442 


109,362 


* 105.264 
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81,288 I 


No. of gaps si kbp 


72,091 


69,175 


67.289 
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V 53,354 


Average scaffold size (bp) 


54,217 


966,219 


1,395,602 


2,348,450 


3,118,848 


Average contig size (bp) 


15.609 


. . 22,496.; - 
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24,916 


■ 25,686 .. 


Average intrascaffold gap size 


2,161 
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1,985 . 


1,832 


1,749 ] 
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Largest contig (bp) 


1,988.321 


1,988,321 


1,988,321 


1,988,321 


1,988,321 


% of total contigs 


100 


95 


94 


• 87 


.79 






Whole-genome assembly 
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No. of bp In scaffolds 


2.847.890,390 


2,574,792,618 . 


2,525,334,447 . 


; 2328,535,466 V: 


. 2,140,943,032 


- . (including Intrascaffold gaps) 




» * * * * ■ 


■ * 






. No. of bp in contigs 


■, 2.586,634,108 


2,334,343,339 . : 


2,297,678,935. 


. 2,143,002,184 


1,983.305,432 


No. of scaffolds 


1 18,968 


. 2,507 -: .-. 


1,637 


818 


554 


, No. of contigs . . .. 


221,036 


. 99,189. „: . 


...... 95.494 


.. 84,641 


• '76,285 


" No. of gaps 


102,068 


-96.682 


■ . 93,857 


\ 83,823 


75.731 


-No. of gaps £1 kbp . ^ ; 


: **" . 62.356 


. M 60.343 . " , I . ; 


.". .59,156 . 


.54,079 


^..^-49.592 . 


Average scaffold size (bp) " ' 5 


- - " 23,938 


* 1,027,041 ' 


' ' 1,542,660 ' 


2,846,620 


3,864,518 


-Average contig size (bp) 


J* - ' 11,702 


* " " 7 23,534 ' " " 


24,061 


25,319 


'* 25,999 7 


Average intrascaffold gap size 


2,560 


2,487 . 


2.426 


2,213 


2,082 












1,224,073 


* largest contig (bp) 


1.224,073 


1.224,073 * 


1.224,073 ' 


1,224,073 


% of total contigs 


100 


90 


89 


. . _■ 83 


77 
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properly place a Celera read, so all reads were 
first masked against a library of common' 
repetitive elements, and only matches of at 
least 40 bp to unmasked portions of the read 
constituted a hit; Of Celera's 27.27. million 
reads/ 20.76 million 'matched a "bactig and- 

« « f\ ^1 111* • 13 ' - l l • -i «. i 



r 

O x ric or contaminating sequence (from 
anotner part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
i effect, the previous steps, in. the JCS A process 
-served .only -to bring .together; Celera frag- 



-issembly took place, but not enough Celera 
/data were matched to truly assemble the 0.5 X ; 
to IX data set represented by the typical. 
Phase :0 BACs; - The combining assembler 
u was also .applied to the Phase *3 BACs ;fbr; 

- «mAtW n /co ' -ir — * - ^V-'.vy ~' f¥?' y^ 10 *!?^^ assem-^-^ vw . yuiy lu uiwg ,iugcuier,^eiera irae- 

. rZ\?lr? t . CheS, :I - ^ Ctl ^ S l lden ^ suggest that a combined wholes :tigubx^ whin we 

• fiedasbelongmg^ 

BAC because then; ^r^d^^g. * gun of BACs will not yield good assembly of : {^ce^h a* initio assembly of Ae^on^- 

unmasked se^ 

• were not found ^ - the : GenBank, data set. c ^ 

wgsemtyyresu^ 

i^lW>? 5' span and consisting of 326 Mbp /^genome was covered ; by-scaffolds \ spanning ; 
^:ofsequence. :Morethan20%ofmescaiT^^ 

■/ we , r ^; - >5 * b P lon & and &ese averaged 63% v sequence and 7.8% gaps with a total of 2.492 
x sequence: and 27% gaps with a total of 302 . . Gbp of sequence.,. There .were a total of 
- ^P; of , s ^^ ue B c ^. All scaffolds >5.kbp were : ^ 1 05,264 gaps among the ; 1 07, 1 99 contigs that i 
' forwarded along -with all scaffolds produced <>: 'belong to the 1 940- scaffolds /spanning > 1 00 " 

*bp. The average scaffold sizewasil.4 Mbp,^. 
.the average contig size was/23.24 kbpVand ■ 
the average gap size was 2.0 kbp where each 
• distribution : of sizes - was exponential/ As ?. 
jfXsuch, averages tend to.be.underrepresentative ? 
^ of the majority of the data. -Figure 5 shows a 'V 
- histogram, of me bases in scaffolds of various 
,size ranges.: Consider! also; that /more than 
49% of all gaps were '<500 bp long, more . 
than 62% of all gaps were <1 kbp, and all; 
gaps are <100 kbp long. Similarly, more than v- 
73% of the sequencers in contigs >. 30 kbp, > 
: - more than 49% is/in cqntigs. >I 00 kbp, and. \ 
vthe largest contig was 1.99 Mbp long; ITable 3 



- Because the Celera data are 5.11X redundant, . ; 
we Estimate .that 240 Mbp . of unique Celera ? 
sequence is not in the GenBank data set. : ■ ■ t 
[: ": In the next step of the CSA •process,- a 
combining assembler took the /relevant 5X ' 

" Celera reads and bactigs for a BAC entry ■ and 
produced an assembly "of the combined data ■ 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply "to provide more reliable 
information for the purposes of their tiling 



by the combining * ^assembler, to, the " subse- 
quent tiling phase. "V v 
, : - ? At this stage, -we typically had one or two 
scaffolds for- every, BAG, region constituting. 



- integers bf overlapping and adjacent scaffold UAot least 95% :of. the wlevant - sequence; and a 

' ''semiencfis in the*, n^vt cten ■ 'Th' VkiUli'na *Uo. ^ ■ «.Un — i:..vf j: • '/iW.f.:. » ' « ., 



sequences in the next step. V In' outline;: the 
1 combining assembler first examines the set of 
matching Celera reads to determine if there 
are excessive pileups -indicative . of ; un- . 
screened repetitive elements. Wherever these 
occur, reads in the repeat region whose mates 
have not been mapped to consistent positions 
are removed. Then all sets of mate pairs that 
consistently imply -jme same relative position 
of two" bactigs are' bundled : into a; link and 
weighted according to me number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bundles of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scafYolds^It i^incorporated to form a 
single scaffofd only if it is "consistent with the " 
majority of links between contigs of the scaf- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. : 

The GenBank data for the Phase 1 and 2 
BACs consisted of an average of 19.8 bactigs 
per BAC of average size 8099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result. For Phase 
0 data, the average GenBank entry consisted ' 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- : 
ing of an average of 58.1 contigs of average : 
size 873 bp. Basically, some small amount of 



collection'of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents, was to determine the order and over-r 
lap tiling . of : these ' BAC. and Celera-unique : 
scaffolds across the. genome. For this; we 
^:used Celera's 50 7 kbp mate-pairs information, -i 
; . > and B AC-end pairs (J8) and sequence tagged 
- site ; (STS) ;markere 

.xrarige /guidance .and;chromosbrhe;separation. ;. 
Given ;the relatively; manageable number of ^provides -surr^ 



Scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
an initial tiling with a good heuristic and then 
use human curators to resolve discrepancies 
or missed join opportunities. To this end, we 
developed a graphical user interface that dis- 
played the graph "of tiling overlaps and .the 
evidence for /each.* -A human curator could 
then explore the implication of mapped STS 
data, dot-plots of sequence overlap, and a 
visual display of the mate-pair evidence sup^ 
porting a given choice. The result of this 
process was a collection of "components," 
where each component was a tiled set of 
BAC and Celera-unique scaffolds that had 
been curator-approved. The process resulted 
in 3845 components with an estimated span 
of 2.922 Gbp. 

In order to generate the final CSA, we 
assembled each component with the WGA 
algorithm. As was done in the WGA process, 
the bactig data were shredded into a synthetic 
2X shotgun data set in order io give the 
assembler the freedom to independently as- 
semble the data. By using faux reads rather 
than bactigs, the assembly algorithm could 
correct errors in the assembly of bactigs and 
/remove chimeric content in a PFP data entry. 



of this assembly with a direct comparison to 
the WGA assembly. , 

2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
. man genome via independent computational 
processes (WGA and; CSA), wa compared 
; scaffolds from the two assemblies as another 
means of investigating their completeness, 
consistency, and contiguity. From each as- 
sembly, a set of reference scaffolds contain- 
ing at least 1000 fragments (Celera sequenc- 
ing reads, or bactig shreds) was obtained; this 
amounted to 2218 WGA scaffolds and 1717 
CSA scaffolds, for a total of 2.087 Gbp and 
2.474 Gbp. The sequence of each reference 
scaffold was compared to the sequence of all 
scaffolds from the other assembly with which 
it shared at least 20 fragments or at least 20% 
of the fragments of the smaller scaffold. For 
each such comparison, all matches of at least 
200 bp with at most 2% mismatch were 
tabulated. 

From this tabulation, we estimated the 
amount of unique sequence in each assembly 
in two ways. The first was to determine the 
number of bases of each assembly that were 
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r. -.. 



■^■'In'-orde^tOydeteraiine'the effectiveness of 



noTcovered Vy a'matc^ .... . 

oier assembly: Some 8i;5 Mbp of the WGA v Vpomts better mtemi mapping 
(3'95%)' was not covered by the. CSA, where- ;X more consistent than the ;WGA, because :itV*: r scaffolds, we first examined the reliability of a . 
as*2045 ma P s by- comparison with large scaf-. 

covered by the WGAT.TMs" estimate did not " ; ^'assemblies of megabase-sized problems, a*; folds., Only 1% of the STS markers on the 10 
require : any consistency .of the assemblies or ^ whereas the .WGA is performing a shotgun largest scaffolds ;'(those ^? v ,Mt>p) , : were 
any^iimqueness -of ^ 
^JanoWrn^ 
which 4 niatches-of less ^ 

pair of scaffolds were excluded unless T they- ^formation loss. berween the two is remarkably^ 

- : r '' M - 'lavmg a- i. small; Because CSA was logistic 



some measure „oi wuwsicui . w vm^.^ w*- ^:^ a ^:^^^:^"^r^Y^-^ ,r ^ :~^ ~ -* ■ * ; 
GbpX95.00%) ^me WGA is^cover^ by the ; ^heeded to;bebegun, all; subsequent analysis : 
CSAy and 2.169 Gbp (87.69%) of the CSA is > < ; was performed on.t^s a^semWy.. , - / V ; 

covered by me \VGA b^ this more stringent _/■;• " . ;.. ... - ■ „ - v ; 

• < r : < ■ f * %■ -V^ '2. 6 Mapping scaffolds to the genome 

measure. v, , r - ; > \* y^'Sirr^;^ 

. .TlieVcomparison ;of WGA ■ to CSA also -r ^The final step.in assembling.the genome was to 



permitted evaluation "of scaffolds fofstmctur- " *- 

:al inconsistencies^ We looked for mstances m J J 
which a large section of a scaffold from one : 

? assembly.matched only one scaffold from the 4i 

Mother assembly, but "iailed to match over the ^; 

; full, length ; of the ^overlap implied by. the ,- 

-matching segments. An initial set of candi- / 
dates was identified automatically, and then .- ; 

r each candidate was inspected by hand. From ; 
this process, we identified 31 instances in , 

> which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 

' ther evaluated to determine which assembly ^ 
is in error and why. : ■ ! ■ -Ai :* . ^ 
./ : In addition,' We evaluated local inconsis- 
tencies of order or onentation. The following > : 
results exclude cases in which one cqntig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 

. agrees -with the, positions they match in the 
former). Most of these small rearrangements 

, involved segments on the order of hundreds 
. of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 
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with fingerprint map placement byinore than 
five BACs. v, When .^furmer. examinmg the V 
. • source of discrepancy, it was found that most ; 
*}'of &t "discrepancy "came 4*pm :4J of , the : 1 0 
r ^afI^a^^mdicat^lg^ & 
^ the .quality of either the map or the;/scaffoldsWV- 
AM four, scaffolds were assembled; as well as " _. ; 
the other six, as judged by clone coverage 
^ analysis,- and showed .the same low discrep- 

; ^ancy -rate ; to .GM99/ and /thus .welconcluded: 

tween the scaffolds.-We next mapped the scafV fethat me fingerprint map global order in these v 
fold groups onto the chromosome using physi- - > cases was not reliable; Smaller scaffolds had 
cal mapping <kta.This step depend on .having; & a higher, discordance rate with GM99 (4.21% . 
reliable high-resolution map information such; u of STSs were : discordant :|by; more :than .five : l ; 
that each scaffold will overlap multiple mark- framework bins), but a lower discordance rate 
ers. There are two genome-wide types of map with the fmgerprint maps ^ : BACs 
information available: high-density STS maps . -disagreed with fmgerprint maps by more man v/ 
and fingerprint maps of BAC clones developed ^ five B ACs). : This observatiori a^ees;^wit^;^e ^ 
J at Washington IMyersity,(^i);jA^ 



.tr.. 



{ order "and orient , me scaffolds on the chromo^ 
; . somes/.We first grouped scaffolds Together on 
rfhe basis of their order in the components from 
■ CSA; These grouped scaffolds were reordered 
: by exa minin g ,residual ;mate-pairing .data ; be- 



nome-wide; STS ^maps, : GeneMap99 - (GM99) 
; has the most markers. and! ^ therefore was most 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of^overlapping r BAC clones. On the 
. other hand, GM99 should have a more reliable 
- long-range order, because the framework mark- 
ers were derived from well-validated genetic \ 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not determine the order of 
sequences produced by the assembler. 



XV 
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<30kb 30-50 kb 50-100 kb 100-500 kb 0.5-1 Mb * 1-5 Mb 5-10 Mb > 10Mb 

Scaffold Size 

Fig. 5. Distribution of scaffold sizes of the CSA For each range of scaffold sizes, the percent of total 
sequence Is indicated. 



fold : construction Twas /.better^su^ported Jby \ 
long-range mate pairs in larger scaffolds 'than v-V 
in small scaffolds. , • 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAG or 
STS) on these maps. Where the order of 
scaffolds- agreed -between - GM99 and the - 
WashU BAC map, we had a high degree of 
confidence -that that order was correct; these 
scaffolds vwere - termed * "anchor scaffolds." 
Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor , 
scaffolds. Scaffolds in GM99 bins were al-- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple mapped markers with . 
consistent order. Scaffolds with. only one. 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% : 
of which are also priented (Table 4). Because ■ 
GM99 is of lower resolution man the WashU 
map, a number of scaffolds without STS , 
matches could be ordered relative to the an- K 
chored scaffolds because they included, se- ^ . 
quence from the same or adjacent BACs on . 
the WashU map. On the oAeryhand, because , ^ 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds detennined 
to be * < unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds - 



14 



16 FEBRUARY 2001 VOL 291 SCIENCE www3ciencemag.org 



f 



ness ;^ measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 48,938 STS markers from Genemap99 
(51) to the /scaffolds." Because -these; markers 



The Human Genome 

y{' 7 Assembly and validation analysis 

We analyzed the assembly of the genome 
from the perspectives of completeness 
* (amount ofvcoverage. of .the ^genome) and 
> ^^^^ ?f r the ; ;:>we^ 

- order ™<[ <f entatl0n and the ^consensus ^:se£;;^ provided a truly mdependent-measure of com- 
P quence of the. assembly)/;: ^ ; v ; 7 - ^pleteneW ePCR (5^and BLAST^fwere 
r * ; 9 w ^^^;Completeness is defined as^ used to .locate STSs .omthe assbmbled genome. * 
- --'^ • ^ - t >f- , - - tne percentage of the euchromatic sequence '.^ We found 44 524 fQl^ th* <TQc ;„fu 

ioiu> navmg nits trom tne same Gene- known with, absolute certaintv 'until the'eu-^ " rs d^ : ™»£ ^^ 'u :: - ' il- *" ;; :>-^--^l^>. 

^2 *a i • 9 . u- n ' U c . an ^^.How^Art impossible to testimate complete- &STS. markers (2 6%) BoVfoijridm^r Cele^ ( ' 
ass lg ned a placement boundary relate to, ^ 



with GM99. : These scaffolds were , termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
nome was-ordered unambiguously. / . 
, Next,- all scaffolds that; could be placed, 
but not ordered, between' anchors were as- 
signed to the interval between the anchored 
scaffolds and were deemed to ' be "bound- 



, tion iriformation,j;;conflicting information,' 
or could only , be; assigned to *. a generic^ 
chromosome location. Using the above ap- 
proaches, ~ 98% of thegenome was an- 
chored, ordered, or bbundedr.—/"; \ ;n:Hr: 
; FmaUy,' we assigned a lbcauon r for "each : 
scaffold placed oh ■ the chromosome by l 
spreading out' the scaffolds per chromosome, f 
We assumed that the remaining unmapped 
scaffolds, 'constituting" 2% of .the. ..genome, 
were distributed, evenly across the genome.*.' 
By dividing the sum of unmapped scaffold 
lengths with the sum of the number of 



v independent set of random sequences" (STS 
; :, -.markers) 'contained \feithe^assembly.^ , ITie ; 
. v whole-genome vlibraries , contain :heterochro-\t 
rr matic sequence and,^ although no attempt has 

; been made - to assemble ity there may be in-^ 
v stances of .unique sequence embedded inxe- - 

, gions of heterochromatin as were observed in . 

Drosophila (50,51). 



93.4% of the -human .genome, . and 
> sembled data 5.5%; for a total of98.9% cover- 
: : :age. :-SimHarlyy we compared .^CSA ^against V 
; 36,678 TNG radiation hybrid markers \(55a) f 
fusing the same method. We found that 32,371 -•; 
:j markers (88%) .'were .-.located m ;the ^mapped - 
CSA scaffolds^ (5.6%) 
found in the remainder. This gave a 94% cov- 



rp, - 1 - • . ... .. T ^.^^w*. O * <* fU WV- 

1 he sequences of human chromosomes 2 lvr -erage of the genome/through another genome- 
' . 22 ^ave been' completed to .high 'quality wide survey; - , ; <-^y~?.r-\ 



and published (48, 49). Although this se- 
quence, served as input to the .assembler, the 



mapped scaffolds, we arrived at an estimate -finished sequence was shredded into a shot- 

of interscaffold gap of 1483 bp. This gap was - - 

used to separate all the scaffolds on each 
chromosome and to, assign an offset in the • 
chromosome. . . , 

r During the .scaffold-mapping effort, we en- 
countered many:problerns .that resulted in addi-' 

bonal quahty assessment and.. validation analy-; , : ^ scale ipf ^components generally multimega^ 
sis. At least 978 (3% of 33,173) BACs were base in size), and so this comparison reveals 



Correctness: Correctness is defined as the 
structural and sequence ^accuracy of the as- 
sembly. Because the source sequences for the 
Celera data and the "GenBank datii ; : are from ; 
different individuals, we 'could not directly 
compare the 'consensus sequence; of the ; as-; 



believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not- be -assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pseudogenes. 

Because of the time required for an ex- 
haustive search for a perfect overlap, CSA 
generated 21,607 intrascaffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
116,442 gaps in the CSA assembly. 

We chose not to use the order of exons 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale for not us- 
ing this data was that doing so would have 
biased certain regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene definition processes more difficult. 



gun data ; set 'so . that the assembler had the 
opportunity to assemble it differently from 
the original sequence in the case of structural 
polymorphisms or . assembly errors in the 
: ; BAC, dataV In particular,. the assembler must \- L / ' -V* '- L;i /•• : ^-r:r i i^'~J:-^ ; :/^ 
; be "able: to resblv* repetitive elements ^ the^^^: 4 - Su ^ary;of scaffold mapping^caffotds 

r *V- c ill M. l '-™r - ^were mapped to the genome with different levels" 

.of confidence' (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 
Anchored scaffolds were consistently ordered by 
the WashU BAC map and CM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the WashU BAC map, GM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their, placements were ..adjacent to a 
neighboring -anchored -or ^ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment. The scaffold subcategories are given 
below each category. 



the level to which the assembler resolves 
repeats. In certain areas, the assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence., of the flexibility- to assemble 
"finished" sequence. differently on the basis: 
of Celera data resulted in an. assembly with >: 
more segments than the chromosohie 21 and'- 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromosomes 21 and 22 and expect 
that they may be typical of gaps in other 
regions of the genome. In the Celera assem- 
bly, there are 25 scaffolds, each containing at 
least 10 kb of sequence, that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining in the 
Celera assembly for these two chromosomes 
is 3.4 Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against the entire genome assembly (52). 
About 50% of the gap sequence consisted of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. 
A more global way of assessing complete- 
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sembly ..against other ^ The 'istan-l"& 5 ^September; 2000! (30> 55b). In this latter 
^determining ^sequencing; accu^ nu^^^d. deviations to be mapped to - 

: : \cleotide level, dmoii^tin^ has beendon^ avoid mapping errors 

identifying . polymorphisms' in Vv length^ with the exceptibn of a fe the only pairs - 

^Section 6. The accuracy' of we* consensus % libraries. The 2VanS^ both reads •"' 

sequence is at least .99.96% oh the basis of a >t ; : tamedlessman2% mvalidmafep at only .one location with less than - 

statistical . estimate .derived from' the ' quality v^as'.the 50-kbp libraries were somewhat higher;;-;; c 6% differences. A threshold was; set such' that V : 
9;yalues>f. me underlyin^lfe 



5>S5 

;.sjc 



' this :me^pdolo^ ^Blue ^tick^ marks .in ; ,the 
; panels . mdicate . : breakp6^ • a : 

v/;similar - (small) |number^of t breakpoints pn 
:i£both: cr^mosometse^ences.^The :exception 1 1 
"-was 12 sets of scaffolds in the. Celeraassem- % 



mapped to the wrong positions because they 
were too small to be mapped reliably. Figures 
6 ;and;7yarid Table^6 illustrate:the mate-pair: 
differences and breakpoints between the two 
assemblies/There was a higher percentage of 
misoriented and misseparated mate pairs in 



mi 

>m 
as 

IS 



-^•^IVquencmg^r^ds^sno 

cpiJ:- sehsus - sequence -.with 7&£ ^ correct Reparation A^forlvalidation purposesj .especially when rsev-;S^sequence (Fig. '6 A) serves as ^validation ,of i 
v/;.;\:'/ s and..onentation i ;between_™ .paire^A pair,:is ;r >-eral mate parrs corifirm or deny. an ordering.: 
termed .*Valid" : . when '-the , reads ;are^ in the ;,^^ The .clone ^coverage [of 'the genome ^was 
J; v*eorrect prientation. and ; &e* distance between '^:3?^i> meaning mat.'anyjgiven base :pak was,-, 
r : ''ithem . is within the mean ."±"3. 'standard devi^^ i .on .average/ contained in 3 9 .clones 6r,- : equivT^ 
. : r;';%tibns of the distribution of insert sizes of meV. ^ralently^ - spanned - reads, v 

■library from which me*pak' was sampled. A Areas of low done coverage of 'areas with a ^ : . bly (a total of 3% of the chromosome length 
■ ;, pair is termed "misoriented". : wheri the reads I^ : high proportion ','pf ^invalid niate : ^aiis - would | '-fin v2 12 . single^ohtig'f scaffolds) :that^were 
" ''.lare not correctly oriented, and is termed "mis- indicate potential" assembly problems! £-We ^ ^ ^ ^ i « 

separated" when the , distance between the ^ .computed the ; .c6verage s.-of each base in the 
: ; v reads is not in the. corf ect range but the! reads > ^assembly ;i bys>alid anate' ; pairs ; (Table 6) . tin :< 
■ v. : are correctly oriented. The mean ± the stan-. >.Asummary, ;for :s kbp in length, ; 

. dard deviation of each library used by the , ^ less.^than 1% of the Celera assembly was in 
; . : - assembler •< was : determined ;- as ; described v.> regions of less man 3 X ;clone coverage. Thus, . 
? L^abdye 0 .:Tojy^date / the^,yWe , examined/ all ^v.more - man 99% ; of ..the assembly, including ^^the , large-insert libraries (50 ;kbp land - BAG ^ 

reads mapped to me -finished sequence r of ; - order and orientation, is. strongly "supported . ends) than in the small-insert libraries in both i ' I ,-v- 
. . . .chromosome 21, (48) and deterrnined how. - by. this measure alone. • ! ^ V : ; . assemblies (Table. 6). The large-insert librar- ^. I vg 

many - incorrect mate pairs there were as a \ .We examined the locations and number of ; : ;>ies are.more likely to identify ^discrepancies ^ A ;- s 
^ result of laborator^.tracking errors and chi-^f:aU^.misprientedvan 

- , merism (two ; different ^segments "ojf. the ; ge- -i^^additipn : to doing . this ^'ahalysisTon me ; CSA ^vKme^ige'nome^iThe ^graphic '^cpmpansbri^i.be- 1 ^> | 
-V nbme cloned into Jhe same plasmid), and how T> : assembly;:(as ;of . ^1 ^October 2000)," we ;also :^Meeh;the twd;a^ ^ 
tight me distribution of insert.sizes was for r ^performed* a smdiy of the PFP assembly as of ; (Fig. 6, B ^and C) shows that there are many;- ; | : ] 
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Table 5. Mate-pair, validation. Celera fragment sequences were mapped to 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement, they were considered invalid (number of invalid mate 
.pairs). -. . _ . - - - 
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gene boundaries. During this process, multiple.., v being joined together, resulting in an armotahon y> . the region of the genome imder^a^ysis was 

■ hits to the same region were collapsed to a : that artificially, concatenated these gene models. > : promoted to the status "of ari' Otto annotation t 
; , coherent set of data by tracking the coverage of. v > ?■:] Next, known genes (those with exact match- -Because the .genome sequence has gaps and 

■ a region. For example, if a group of bases was v. es of a full-length cDNA sequence to the ge- t sequence errors such as frameshifts, it was not : 
• represented by multiple overlapping ESTs, the > nome) were identified, and die region corre- . always ..possible to predict a transcript that ; 

union of these regions matched by the set of * sponding to the cDNA was , annotated as a . , ; agrees precisely with me experimehtelly deter- ^ 
. ESTs on the scaffold was marked as being ./^predicted transcript. A .subset: of the -curat- , r mined cDNA sequence. 1 A total of 6538 genes • 
. . , supported by. EST evidence. This resulted in a • ed human gene, set RefSeq from the Nation- >in our inventory were identified, and transcripts^ 
. series of ''gene bins " each of which was be- : ; al . Center . for ^ Biotechnology :-lnforniation v' predicted in this way. J Wz^Mf^zff? 
lievedto contain a single gene. One weakness of; ^;;(NCBI) was included as a data set searched in Regions that have a substantial ambuhrt 
this initial implementation of the algorithm was - v.the. computational pipeline. If a RefSeq tran- . sequence similarity? but do^ttoatch known" 
.in predicting gene boundaries in regions of tan- .script matched the genome assembly .for at least ; V genes, were analyzed' by ^t^pa^ pfme^Ofto s 
/ ; demly duplicated genes. Gene clusters frequent- ; ; 50% of ^ its . length at >92% identity, then the - ; .system that uses the sequence^sin^laiit^ in-;; 
Ay resulted in homologous neighboring genes V.;.SIM4 {63) alignment of the RefSeq transcript to . formation to predict a transcript. '^ere, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) all of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriehted; yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/l304/DC1. 
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:, : - , : ^ -.i '^fej ; K • ' 
. bases flanking these regions). -The other bases 
%; : m the region, those not covered by any^homol- 
:^ogy evidence,^w^^ 

: /; quence segment, with high confidence regions 
/represented .by 7 the t consensus;: genomic ■ se- 
quence and the remainder, represented by.-N*s, 
\, ^was ithen evaluated by .Genscan '«■ to .see if a 
^ consistent gene model could be generated This. 
^procedure ;simplifled ,the/gene-prediction :task; 
Tby first es^blishnig 'i&^ymi^'foci^g^, 
u. (not b a = strength iof jimost gene-finding ?algo- 
: ^ rithms),?and ibj^elimmating ^regions Iwith mo; 
^■is^porrmg^vidence. f&V Genscan ^returned >a 
^plausible gene models it was 'further^evaluated 
Vrbefore bemg promoted to an "Otto": ^annotation, 
v; ; The final Genscan predictions were often quite 
I: t.. different from the. prediction ..that Genscan re-: 
; . toned on the same re^on of native genomic 

• sequence." "A" weakriess'of using Genscan to 
: refine the gene model is the loss of valid, small 

exons from the final annotation/ 

The next step in defining gene structures 
based on sequence similarity, was to compare. 
■ each predicted transcript with the homology- 5 ,> 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 

in the prediction. Internal exons were consid- 
ered to be supported if they were covered by ■ 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
s - edge was required to be within 1 0 bases, but the 
. .external edge was. allowed greater, latitude to; 

• "*■•: allow for andv^^ 

; (UTRs). To . be retained, a prediction for a 
multi-exon gene must have evidence such that 

' the total number of "hits," as* defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 

RefSeq sequence., A "s^le-exqn .gene^m.ust be 

covered by at least three supporting hits (±10 
bases on each side), and these must cover the . 
complete predicted open reading frame. For 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 

Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number [N) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). V 



T H E H U M X N C E;N O M "eV 

. : .those that passed ^were- promoted to O yen? numDer 

V>predictibns/rH6mology^based^ system is quite 

^tions do not contain 3 k and *5 ^untranslated v^r conservative; we used a different -gene-pre- 

^sequence, ^ ho ~ 
yU programs [GRAIL, v Genscan, v and FgenesH : ^ mology evidence was less strong. Here the 
(53)] .were'run as part of me^cornputational /^results rof^de - novo gene predictions were 
^i:analysis;the results ; of mesel)rogi^.were not ov-r used." For these genes; we insisted that a 
directiy^used ;in making me .Otto^predlctiorlS. ^4predicted transcript . have at least two of the 

^C^-prefl 

-^mear*^ the gene set for further analysis: protein, 

- QvMfr «^t; : S 's^sag ;&fe;4 ^k^omm EST; rodent EST, or mouse genome 
^3.2-.Ottoy^ ^/ fragment matches: :This final class of pre- 

C'^To . vaHd^£toe^ process indicted ^genes is ' a: ?subset : of the predictions ; 

5& and the miethod that .Otto uses tto define die 4^made;;by, .me;,three^ene-finding programs 
structures of known genes,; we compared tran- ;. ; . that were .used in the computational .pipe-. 
] -/ scripts predicted by Otto with ^correspond- line:yEor 4 these, v. there : -was riot -sufficient 
i ving (and presumably correct) transcript from a ^sequencevsimilarity infbrmation for Otto to 
^ set of 45 12 RefSeq transcripts for which there ^- attempt / to predict a %ene structure. -:The 
'^was a unique SM4 alignment (Table ;7). In ^ ^three; de novo gene-finding programs re- 
-•c order to evaluate the Relative performance of^&sulted inVabout ; 155,695 predictions, of 
* Otto and Genscan, we made three cornparisons. t^^vhich ?~76,410 were: nonredundant (non- 
■ * Hie first involved a determination of the accu- •; overlapping with one another).' Of these, 
-racy of. gene models predicted by Otto with .,-.57,935 .did - noU overlap .known .genes or 
.v ^ only homology data other than the correspond- -0 predictions made by ;,Otto;;Only 21,350 of 
ing RefSeq sequence (Otto homology in Table : - the gene predictions that did not overlap 
■1). We measured the sensitivity (correctly pre- , 
. dieted bases divided by the total length of the - 
scDNA) and specificity » (correctly predicted 
. bases divided by the sum of the correctly and 
-incorrectly predicted bases). Second, we exam- L 
ined the; sensitivity and specificity, of the Otto 
.^predictions that'were made solely jwith the Ref-x 
^rSe^ sequence, ^. which; is me:process,1hat Otto : 
-a uses to annotate known . genes (Otto-RefSeq):% 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 
. RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
. and Otto-homology performed better than Gen- 
~ scan by boui criteria. Thus^ 6A% of mie Refeeq 
; nucleotides 'were ,not represented in the Otto-. : 



Otto predictions were partially supported 
> by at least one type of sequence "similarity 
evidence, and 86 19 were partially support- 
ed by two types of evidence (Table 8). 
■* - :TTie sum of this number (21,350) and the 
number of Otto annotations (17,764), ^39,1 14, 
';is near :V the -upper limit for the human gene 
^complement. ^As : se«n ^ 
^q^irementefbr^bmer^suppoto 
made more stringent, this number drops rap- 
idly so that demanding two types'of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to ~23,000. 
Requiring that a prediction be supported by 
fall fouTcatejgories of evidence is 'too stringent 
because it would eliminate genes that encode 



refseq annotations and 2.7% : of the nucleotides ^ novel proteins- (members > of currently unde- 



. Method ■ 


Sensitivity 


Specificity 


Otto (RefSeq only)* 
Otto (homology)t 
Genscan 


0.939 
0.604 
0.501 . . 


0.973 
0.884 
0.633 



in the Otto-RefSeq transcripts, were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstruct genes, the absence of 
experimental evidence for intervening exons. 
may inadvertantly result hi a set of exons that 
cannof be spliced together io give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a singje transcript We also examined 



scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
arnined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- . 
.lowing cvidenci.T^es— homology to mouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs — or similarity to a known protein 
reduced this number to 1010. Adding this to 
the. numbers from the previous , paragraph 



r\ci ci i iu uio»e oiiiiuwsuviw w 7™ o j 

the Sim4-polished RefSeq alignment rather' than an evi- 
dence-based Genscan prediction. t Re f ers t0 those 
annotations produced by supplying all available evidence 
to Genscan. t _~ - - 



the tendency of these methods to incorrectly r„: ; would give us estimates of about 40,000, 
spm gene pre&c^ 27,000, and 24,000 potential genes in the 

in Fig. 8 Both RefSeq and homology-based human genome, depending on the stringency 
predictions by Otto split known genes into few- of evidence considered. Table 8 illustrates the 
er segments than Genscan alone. - . number of genes and presents the degree ot 
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t^"? based ,°" * e su PPPrting evidence, port .the Otto and other predicted transcripts. 4.1 Cytogenetic maps - 

of 

produced 

-.'.tnmcripts from de novo gene-predMon pro- ' : ™vo irusmoOi. . . ., . „ , ;.v20%»f. the tan* otromosome comple. 

• gas Ihu hn « «■.„ of , w<6as W * .4 Oonom. S.ructuri : / CIS""*. * o. xoMiadv. 

completely. computational processes, not ex- ,' ' ? ' ." ' . ' : ' f " *ese were included in the assembly. 

/•Pert curation.;We have attempted to enumer- : 7 
ate the genes m the human "genome in such a ^ 
. way that we have different levels of confl- , J 
; dence. based on the amount of supporting If 
- evidence: known genes, genes with good pro- 
tein or EST homology, evidence, and de novo ; 
, gene predictions -confirmed .by modest ho- h- 
mology evidence. 

3.4 Features of human gene . 
transcripts ,\: , , [ 

We estimate the average span for a "typi- 
cal", gene in the human DN A sequence to 
be about 27,894 bases. This is based on the V* 
.. average . span , covered ■ by . RejfSeq tran- - 

scripts, used because it represents our high- 
. est confidence;set. ' > u^J.;. .-. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
4hose-pK>moted™from -^gene-prediction -pro- 




f : 



V^i'j. -*= r 



... f 



m Otto (homology) 

□ Otto (RefSeq only) 

□ Genscan . .-.•-= 



jO 



n, n, n, 



n n 



JI 



•0 ^'1. ^2 



3 x4 5 ] 6 



8 



9 10 11 12 13 14 15 16 17 



Number of predictions per RefSeq transcript 
Fig. 8 Analysis of split genes resultinjg from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for cnteria) and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely 

jnvuwiva "win* ^gciic-prcuicnon -pro- ■■» on S m4-pol!shed RefSeq alignments, and.Otto (homology) annotations (annotations produced bv 

grams average about 3.7 exons. The. largest : su PPiyj n g. a « ava(lable evidence to Genscan) were tallied. These , data show the degree to which 
number of exons that we have identic in * ; mult, P le -venscan predictions, and/or; Otto annotations we>^ RefSea 

transcnpt.The zero class for the .Otto-homology predictions shown here indicates that the 
Otto-homology calls were made without recourse to the RefSeq transcript, and thus no Otto call 
was made because of insufficient evidence. 



number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 



Total 



Types of evidence 



Nol of lines of evidence* 



Otto 



De novo 



No. of exons per 
-transcript . v.v 



Number of 

transcripts 
Number of 

exons 
Number of 

transcripts 
Number of 

exons 
Otto 

De novo 1 < 



17,969 
141,218 

58,032 
319,935 



Mouse 
17,065 
111,174 

14,463 
48,594 



Rodent 
14,881 
89.569 
5,094 
19,344 



Protein 
15,477 
108,431 
8,043 
26,264 



Human 
16,374 
118,869 
9,220 
. 40 J 04 



1 

17,968f 
140,710 
21,350 
79,148 



^2 
. 17,501 
127,955 
8,619 
31,130 



S3 
15,877 
99,574 
4,947 
17,508 



2:4 
12,451 

59,804 

1.904 

6,520 




) were 
fThis 
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■^"euciiriiMc^ r^the ; .gen^^ en- ■ 

K: : ; R-, and T-ban<k (57).^These. cytogenetic bands / ; ~densiry r ih our analysis. In. addition," cnromo-> r^abiedbyja ncarly:coniplete%en6ine sequence is 
% have been presumed to differ in their nucleotide ^vsome'8^ gene .. ..> to; produce ; : ;the; ultimate physical -map, and to 

two other/; .; 
^genome 'i'y. 

'the^ioop 




ige map 

H2, and H3)/;\yhich' are. >300 kbp .in length ;2 y gene/toen we see that 605.Mbp, : or about 20%; ; ;^ toe .'genprne. vlW rate/of recornbination, 'ex- 
(69). Bemardidefoed;^ 

1 G+p-poor.;{^3%), : windows ^shpymSn^ 
/■isoch^ 22^e^^^ 

. resenting 24A8/and 5% of the genome/;£ene^^ ';: .'. 

^:concentratipn has been claimed to very. Io\v^^ ,;v 
in the L isochores and 20-fold more enriched in ■ "•■ somes 4, 13, -18,' arid? have 27,5% of their 492 C \ 

: Mbp in deserts (Table llj.llie apparent lack of ;f lowest rates ..and highest rate's and the largest ; I 
^predicted genes ; m.these;regions : 4qes not nec- ^,;;4iifeence if^4.4> 

: essarily imply that they are devoid of biological '. > k (4.99 to b,'47i bn chr^ indi- ' • 

functioa' ■ : , . . . cates that: the variability. -in recombination 

. . rates .amon 

4.2 Linkage map ; : - ; : : the,4ifferences in recombination : :rates • be-:-/ 

/Linkage maps provide the basis for genetic y : tween . males; and females.; The : human ge- > 
analysis and are widely. used in the study.of the , - ; nome has recombination hotsp6ts,\where re- 




the H2 and H3 isochores. (70). By examining 
contiguous 50 : kbp . windows of G+C content 
across the assembly,- we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver-, 
aged 202.8 kbp in length, and the average span ' 
of regions, with <43% (L isochores) was, 
1 078.6 kbp. . The . correlation between G+C 
content, and gene density was also examined in 
50-kbp, windows along the assembled sequence .. 



, (Table 9 and Pigs.; 10 and . 1 1). We found that; ,; ; "tween homologous chrome meip-, : >./. crate- 

. - the density pf genes was greater in regions ' of ^ 1 ' V .*J,' v ". . V : ' i' . . ' ■ 



will .;depend-iOnAhe.vsi2e;of iej window*;; 



r. «*. 



high G+C than in regions of low G+C content, 
as expected. However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3 -containing 
bands, had the highest gene density (Table; 
10). Conversely, of the chromosomes that we 



Table 9. Characteristics of G+C in isochores. 



Isochore 


G+C (%) 


Fraction of genome — 


Fraction of genes 


— Predicted*"™',"" , Observed 


: .'V. Predicted* . 


Observed 


H3 

H1/H2 
L 


>48 , . 
43-48 
~ ~ <43 


5 ; , 9.5 
25 21.2 ... 
67 69.2 


" '37 • 
32 
31 


- 24.8 
26.6 
48.5 . 



*The predictions were based on Bemardi's definitions (70) of the isochore structure of the human genome. 




Fig. 9. Comparison of 
the number of exons 
per transcript between 
the 17,968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest 
number of transcripts 
in the two^exon cate- 
gory, but the de novo 
gene predictions are 
skewed much , more 
(toward smaller tran- t " ~ q _|_1E! 
scripts. In the Otto set, * * " 
19.7% of the tran- 
scripts have one or 
two exons, and 5.7% 
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have more than 20. In the de novo set 493% of the transcripts have one of two exons, and 0.2% have more than 20. 
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1 examined. Unfortunately ton frw - : • : ' • - . 

crossovers h^e occurred in a£ ' ^^ C( *^ *^le ton«ios ^ 

du Polymorphism Humain (CEPH) and other • It is difficult to , ♦ ' •' ported b y othei ^ The last two rows of 

, , finery about.3,1^^ next challenge 

". recombmabon at the^omosomal level/Ah : 
accurate pred^or fbr^therate for ^aria^ 

recombination. .rates t^ee^any/^o^ sland 
markers would be extremely useful in desi^i - B:?°^ n ^Howeyer, we;^ 'and. the first exon^f^^^fe ; -V? , 

ing.m^eWto^ow,^^^ 



Correlation .between CpG islands i^an^^ 
■ arid aonpc v.-*: s\ ■ ^r^?* -^-^available annotation ofchrhmncnmp '99 \»« ; 'ri i-L j ' « ■ t < : T V C ^ UI "puiea ■ tne 



:,CpG, islands are /stretches -of :Animethylated 
DNA with a ^gher^frec^^ 
;-. f dinucleotides when compared with the entire 
; genome (74). CpG islands . are . believed to 
^P^^fentially occur at the transcriptional start : 
; of genes, and. it has.been observed that most > 
[■ housekeeping genes have CpG islands at the 
*: ; 5'- encl of me.tanscript (75,f6). In addition,^ 
; experimental evidence indicates that' CpG is- • 
land methylation is correlated with gene in- ^ 
\ activation (77) Tand has -been shown to be- 
important during !g ene imprinting (78) and r" 
: tissue-specific gene expression (79) 

Experimental methods have been used . 
t that resulted in an estimate of 30,000 to '<■■ 
: 45,000 CpG.islands in the human genome 
; (74, 80) and an estimate of 499 CpG islands 
t ; on human chromosome 22 (81), Larsen et r 
\ al (76) and Gardiner-Garden and Frommer ^ 
: ; (75) used a computational method to iden- . i 
. tify CpG islands and defined them as re-^ 
igioris of DNA of >200 bp that have a G+C " 
content of >50% and a ratio of observed 



chromosome '22 arid after the application of 
J K= a higher threshold (method 2) on both data : 
> sets. . In .sum,*. ;genome r wide ^'analysis - has 
/^extended earlier analysis ; : and Suggests a 
. strong correlation between CpG islands and 
..first coding exons. 

: 4.4 Genome-wide repetitive elements 

The proportion of the genome covered by 



CS A sequence as CpG, but 40% of the gene 
starts (start codons) are contained inside a 



45 



40 



35 



CO 



30 



25 



20 



15 
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30-35% 35-40% 




■ % of genome 
□ %. of genes 




Some T ^ lue b - ^ow the percent of the 

genes associated with eachG+r hin , i! ♦ A C C ? nte » t The P ercent of the total "umber of 

5% of the genome has t c+CrZ^T^ by * he yd,ow bars - The 8 ra P h ^ows that about 
nearly 15% of the genes ° 50 and 55% ' but tnat tnis P orti °n 

contains 



\tential island if. it scores less than the- 
threshold. '. • 

: • V Tc : icompute, ^various \CpG > statistics, we 
used two. different thresholds of CG dinucle- 
otide .likelihood ratio. Besides using the orig- 
inal .threshold of 0.6 (method 1), we used a 
higher threshold , of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 

the number nf r\> uu a T ' * J1 ^ WU ° 11 01 m « genome covered by 

99 ^ ♦ $ ^ P V daDdS ° n chromos °nie, . .. various classes of repetitive DNA is nresenf 
22 close to the number of annotated genes on, ; :ed in Table 14, We^SSofi 

ma^ 
CSAse^uence^ 

• me Celera assembly as a result of incomplete 
' repeat resolution/as discussed above. About 
8% of the scaffold length is in gaps/and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
-peat density (57°/o>,-as *well *as^ the 'highest" " 
: ; gene density (Table .10). Of interest, among 
,^.the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

Summary, The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications mediated by 
KNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition 'events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- . 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. r- - 
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Fig. 11. Genome structural features. 
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Fig. 11 1 (continued). Relation among gene density (orange). G+C content dows. The percent of G+C nucleotides was calculated in 100-kbp 
(green), EST density (blue), and Alu density (pink) along the lengths of : windows. The number of ESTs and Alu elements is shown per 100-kbp 
each of the chromosomes. Gene density was calculated in 1-Mbp win- window. 



5.1 Retrotransposition in the human 
genome 

Retrotransposition of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
inactivated genes (pseudogenes). A paralog 
refers to a gene that appears* in more than 
one copy in a given organism as a result of 



a duplication event. The existence of both 
intron-containing and intronless forms of 
genes encoding functionally similar or 
identical proteins has been previously de- •.. 
\scribed (84, 85). Cataloging these evolu- 
tionary events on the genomic landscape is 7; 
* of value -in understanding the functional 
' consequences of such 5 gene-duplicatiori v 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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.Otto-predicted, single-exon genes were sub- 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 



/ 

v 



pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
; in the. human genome as ;a starting point in 



The Human genome 

5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 

ed transcripts. . Using ho^b^Sof the human" genome as a "s^^STta 

70%,sequence .identity Wer 90% of t£ V ^ 
; length, we idenlifi^ ; J v^fv- v ^r -- Ah ■ ~' - ; 

vto.n^ulti-exonxorrespondence; Of these 29 '.* i^r;*- '*/. A Ai A ; jj. A- ; ? A'A ; AvA ; v 

sequences, 97 were represented „ in the GenA^-~T — ■'■ ~ ' -' ■ " ' ' " ' 

Bank data set-- nf wnpnm^itaiii, ^izejbf the.'genorhe (including gaps) 7 

Size^of the genome (excluding gapsp - ^ ' ■ 
Longest contig ' A ; - • A v V '/ 
j^r^^'sdirfpld/'VVt";v;'V> AA ^A : ." A'V "/' 
Percent/of A+T Ih'.the genome ^ ' AA A^ ■ .': 
< Percent of G+C in the genome ' :i —--A 
■-Pfrcent . of undetermined bases .in ;the genome 



Bank data set ; of experimentally^ validated 
; ^lUength. genes ;at the: stringency specified 

and were verified by .manual inspection • . 
: ; • We believe, that these ^97 ca^es rtayrep-' 

resent intronless paralogs (see Web table 1 on 

Science Online jat www.sciencemag.org/cgi/ ■^. : f- 6r f e 9 t ..°f undetern 
,cont^fiiM9^550W3(^l) '.of - knbwn^!^^ c - rich:5 P' H >- • 

genes.. Most Jof these , are flanked by direct i ^p!!? ^ C ; rich 50 kb , ; •V/T' 

^sequenc^alt^ ; 

ofS^!> re TT t ° b u det T ined - M W^t of annotated g^neLith unknown function ^"^■T^ 
of the cases, for which, we have Wgh confi- ;^ Number of genes, (hypothetical and annotated) 
dence contain pqlyadenylated [poly(A)] tails %P ercent °f hypothetical and annotated genes iwith unknown function 
charactenstic of retrotransposition. Gene with the most exons '• . . /v. . ••" .- • ,• - 



I, )■'-*"■-'»■ 



Chrom. 



Recent publications describing the phe- 
nomenon of functional, intronless ; paralogs 
speculate that retrotransposition may serve as 
: a mechanism used to, escape X-chromosomal 
inactivation (84, 86): We do not find a bias 
toward X chrompsome' origination of mese"" 
retrotransposed genes; rather, the results 
show a random chromosome distribution of 
both the intron-containing and corresponding 
. intronless paralogs. We also have found sev- * 
eral cases of retrotransposition from a single 
. source chromosome Vto multiple .target chro-:' 
^/mosomes/ Interesting examples include the 
■ .retrotransposition of; a five exon-containing . 

ribosdmal protein L21 -gene on chromosome '= 
, . 13 onto chromosomes 1, 3, 4, 7, 10, and 14, : 
v respectively. The size of the source genes can v 
also show variability. The largest example is' - 
the 31-exon diacylglycerol kinase zeta gene 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, v retrotransposition with subsequent 
gene changes in coding or noncoding regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our preliminary set of retrotransposed in- 
tronless paralogs contains a clear overrepre- 
sentatiori of genes involved in translatidnal 
processes (40% ribosomal proteins and 10% 
translation elongation factors) and nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory enzymes! 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue-specific gene 

expression. Defining which, if any, of these - 

processed genes are; functionally expressed F- * A ^ 
and translated will require further elucidation' * 
and experimental validation. Cenome 



Gene with the most exons 
Average gene size 
Most gene-rich chromosome 
Least gene-rich chromosomes 



Total size of gene deserts (>500 kb with no annotated genes) • 
; Percent of. base pairs spanned by genes : ; : \ 

♦ Percent of base pairs spanned by exons " . = ^ . ' ' 
Percent of base pairs spanned by introns 
Percent of base pairs in intergenic DNA 

Chromosome with highest proportion of DNA in annotated exons 
^Chromosome yv,th. lowest proportion. of DNA in annotated exons : 

R^f^ ^r 28 " 1 ^^ 0 " (b6 ^ eeh annotate <* + hypothetical genes) 
Rate of SNP variation ' 



;| A' 2.91 Cbp 
'\\:, l 2&6 .Cbp- \ ^ • v- v 
■trfl99' : Mbp'>% ?A$-?;\ 

: ';'-54"'.'':-/- vv';'v^ 

i- Chr. 2 (66%) :■ ■ - 
Chr. X (25%) -'>'• 

..: 35 -\ •• ••• 

/ 26,383 
" 39,114 • • 

-59 r' . 

; Titin (234 exbris) ■ ~ 
71 kbp . . 

-Chr. ;19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
Chr. Y, (5 genes/Mb) . 
605 Mbp 

: 25.5 to 37.8* ; ' ■ 
1.1 to 1.4* 1 h 

24.4 to 36.4* 

74.5 to 63.6* 
Chr. 19 (9.33) 
Chr. Y (0.36) 

Chr. 13 (3,038,416 bp) 
1/1250 bp 



Mate 



1 

* 2 
3 
4 

• 5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
X 



* * 

Max. 




Min. 


Max. 


2.60 


1.12 


■ V6.23 


.2.81 : 


* 2.23 


0.78 


0.33 : 


2.65 


2.55 


0.86 


s 0.23 


2.40 


1.66 


■? 0.67 


0.15 . 


2.06 


2.00 


0.67 


0.18 


1.87 


1.97 


6.71 


0.28 


2.57 


2.34 


1.16 


0.48 . 


; 1.67 


1.83 


0.73 


0.14 


2.40 


2.01 


0.99 


0.53 


1,95 


. 3.73 


1.03 


. 0.^2 


3.05 


1.43 


.;0.72 


0.31 


2.13 


4.12 


0.76 


0.26 


3.35 


1.60 


0.75 


0.01 


1.87 


3.15 


0.98 


0.18 


2.65 


2.28 


0.94 


0.34 


231 


1.83 


1.00 


0.47 


2.70 


3.87 


0.87 


0.00 


3.54 


3.12 


1.37 


0.86 


3.75 


3.02^ 


0.97 


0.10 


2.57 


3.64 


0.89 


0.00 . 


2.79 


3.23 


1.26 


0.69 . 


2.37 


J. 25 


d.10 . 


o 0.84 


1.88 .. 


na .: 


. : ■ NA 


NA - J 


NA J 


- NA' ' 


-""NA' 


- NA ■ 


NA " 


4.12 


0.88 


0.00 ] 


3.75 



Sex-average 



Female 



Avg. 

,1.42 
"1.12 
1.07 
1.04 
1.08 
1.12 
1.17 
1.05 
1.32 
1.29 
0.99 
1.16 
0.95 
1.30 
1.22 
1.55 
135 
1.66 
1.41 
1.50 
1.62 
1.41 
- NA 
NA 

1.22 



Min. ; 


Max 

- - . t" 1 - 


: Avg. 


Min. 


■ 0.52 


v ; 3.39 


1.76 


0.68 


0.54 


V'3.17 


' 1.40 


0.61 


0.42 


2.71 


130 


0.33 


0.60 


2.50 


1.40 


0.77 


0.42 


. 2.26 


1.43 


0.62 


037 


3.47 


1.67 


0.64 


0.47 


2.27 


1.21 


034 


0.46 


3.44 


136 


0.43 


0.77 


2.63 


1.66 


0.82 


- 0.66 


2.84 


1.51 


0.76 


0.47 


3.10 


132 


0.49 


0.49 


2.93 


1.55 


0.59 


0.17 


2.49 


1.19 


032 


0.62 


3.14 


1.63 


0.75 


0.42 


2.53 


1.56 


0.54 


0.63 ; 


4.99 . 


2.32 


1.12 


0.54 


4.19 


1.83 


0.94 


0.43 


435 


2.24 


0.72 


0.49 . 


2.89 


1.75 


0.87 


0.83 - . 


331 


2,15 • 


134 


1.08 


2.58 


1.90 


1.18 


1.08 


3.73 


2.08 


0.93 


NA 


3.12 


1.64 


U72 


NA 


NA 


NA 


NA 


0.17 


4.99 


1.55 


032 
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•'- '^^^ ^ ^->- - ,^v. ^TH E H U M A N,G E N O M E ,^y-. V-.,:^ 

that account for. gene, inactiva 
: J:> : v "eral'f structural 'characteristics • 

. ! -"- cessed v psieudogenes-c includejithe ;) w>mple^*^Xrarisaripts that give rise' to processed pseu-1?; Xing' the role 'of wU6ld;genbme or chromosom- 
y £ lack .of intervening sequencesl-found in the - y } - dogenes . ; hay e ^shorter ;. average ; transcript \ - f al duplication in protein family f expansion as 





/ ; i i^cur as ,a result of retro 

1' - f: Id] ^unprocessed pseudogenes'arise from se^en^fi! §^nce^ tSpQit There : \\V. n^^ ^complements of 

7 / :,>v^tal genome .duplication;' "ft> : 

^^p^i^tra^ts^ftc 

; -:r -V* - : quence; .by "means "of ; BLAST;; .Genon^ (1(J%); translation elongation fac of 

," : :: : '^ygions r-jcforresponding ; to :.^all ;5Qtto-^redicted v^.tpr alpha (5%^ genome. -The ^variance of .each i :organ- 

• ;>- ; f transcripts were excluded, ifrom tWs v analysis. ^;jteiris^{^)^^ tO;each cluster.can then be 

y, i%i$ffi .identified ] 2909 regibns >m"atching wim ; ^^(c^c^te^-'alld 1 ^^ anjassessment ■ of the rel- 
y ^v^greater than 70% identity over at least : 70^ o^i;{an5 ;p>rpcesse<^ 7amonig ; genes ^Jarjye :imp^ 

r the length of the tr^cripte that likely lepre-f: y Snwlyed^m rr^slau^ri anil nuclear' regula- ■ ; .versus 'smaller-scalejSprganism- 

:-. sent processed pseudogenes^This ^number is. :: -vttioh may reflect an'increased ,Uan^ription- j^;ipansioh; and. cpnu'ictipn] x>f : 'protein^Yaznilies, 

i * i.„vj^L.-t.-.i.^.::-/-.' :o-._i ^-Li-j-^. j^^^li: i _^ j^-^k .-. .= . .-■> : : presumably ; as z result of natural .selection 

, y " operating on individual protein i families with-' 
> \ in an organism. As can be seen in Fig. 12, the 
; t , large yariance ;in. the relative numbers of hu- 



4 ;i probably an underestimate Vbecause^speciflc^v.v aractiyiry of these genes/ 
methods to search for pseudogenes were not ' ' . . : 

used. : v . ; : ^s^v '53 Gene duplication in the human 

. We looked for n correlations " between ; genome " ;* ' V' " " * V"' . ; : . 

structural elements and the propensity for : ' . Building on a previously published procedure . : man as compared ^ with Dlmelanogaster and 



rerrotransposition in the human genome. (27), we developed a graph-theoretic algo- 
GC content and transcript length were com- ■ , : : J ^rithni^ called LeX for jgirouping the predicted 
pared between the genes with , processed 'cchunian protein -set into !protein ' families (89). 

< *t jj " * • " %" - T * L '. 

Table 13. Characteristics of CpG 'islands. Identified In chrornpsome 22 (34-Mbp sequence length) and the 
whole genome (2.9-Cbp, ^sequence, length) by means of , two different methods^;Method^.l uses a CG 
likelihood ratio of 5:0.6. Methdd 2 uses a CC .Ulilihood 'ratio 6ife6.8. ,i; " v 



■ .- * r T 



Chromosome 22 



Whole genome 
v (CS assembly) 





Method 1 


. Method 2 


„ Method 1 - 


Method 2 


Number of CpG Islands 


5,211 


522 


: 195.706 


26,876 


detected 










rAverage length of island (bp) ::r 


390' ' " 


535 - 


395 


- 497 


Percent of sequence 


5.9 


0.8 


2.6 


0.4 


predicted as CpG 










Percent of first exons that 


44 


25 


42 


22 


overlap a CpG island 










Percent of first exons with 


37 


22 


40 


21 


first position of exon 










contained inside a CpG 










Island 










Average distance between 


1,013 


10,486 


2,182 


17.021 


first exon and closest CpG 










island (bp) 










Expected distance between 


3.262 


32,567 


7,164 


55,811 


first exon and closest CpG 










island (bp) 











Table 14. Distribution of repetitive DNA in the compartmentalized shotgun assembly sequence. 



Repetitive elements 



V . Megabases in 
assembled 
0 ]i*Z sequences ' 



Percent 
of 

assembly 



Previously 
predicted 
(96) (83) 



Alu :// r.?rrx;\*a . - l^o : ^^yi ■ *:\T n-?-j3i:288*?™^^- 
Mammalian interspersed repeat (MIR) r v ' :r 'uV rt *- r 66 y 
Medium reiteration (MER) . * i — * - - *t : ' : - ^ -50 
Long terminal repeat (LTR) 155 

Long interspersed nucleotide element t y 466 . t 

' (LINE) 

Total . 1025 



r 9 3 ^y. 

23- * 
1.7 - - 
5.3 
16.1 



353 



10.0 
1.7 
1.6 
5.6 

16.7 

35.6 



Caenorhqbditis elegans proteins in complete 

- clusters may be explained by multiple events 

- of relative expansions" in gene • families ^ in 
each of the three animal genomes. : Such ex- 
. pansipns would , give rise , to . the )d4stribution 

'» that shows a . peak .at 1 : 1 - in the " ratio for 
y human^worm or Ihuman-fly /clusters with the 
-; siope spread:. ^yermg ;b6m;;hu^^ 

- worm, predominance, i ,as fwe -/observed (Fig. 
.12). Furthermore, there are nearly as many 

clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that MJSr^!.- s ?ifSSS? acting on 

, individual protein families has been a major 
force driving the expansion of at least some 

/ ; elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. . 

5.4 Large-scale duplications . 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family- based method that identified highly 
conserved blocks of duplication. We then 

- desmr^ TDur ccropr^ens for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. ^, r .- [ n :r -/ y;r : 

The first of the methods, is based on the 
idea of searching for blocks of higjily con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
. equivalent if their protein products were de- 
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termined to be in the same family and tl 
- same complete Lek cluster . (essentially 
paralogous genes) (89). Initially, each chro- 
mosome was represented as a string of genes . ' 
ordered .by . the .start,, codons, for ..predicted 



filtering methods, a shuffled protein set was ^ jis at several evolutionary stages (94) The 
-first. created by taking the 26,588 proteins,:- figure also illustrates that some chromo- 
randomizmg their order, and then partitioning / somes, such as chromosome 2, contain many 
them, into, .24 ; shuffled {chromosomes, each .^tmore detected ;large-scale; duplications than 

'•S^^S?2SS e S [ > ^m^^sV^^ every ^ain.^Ms ^-thkt Signs to zpLofo^Z 

relative to large-scale duplications.. Each v appeara (fae.same number.of times-^.-tcm-^iachh^oiome-M -with one revetment 
: : gene W as^exedaccor^g^ 

Sfarmlyandjekxornplete^ 
same cornplete clusterwas givm 

cated set spans 20 Mbp on chromosome 2 and 

'tiu»e» noram.t^ i o j • » " '■ . " . iTi. v. °:. — ; •■— -~- >v"- . p 63 Mbp on chromosome 14, over 70% of the 

these parameters, 19 : conserved ^erchromo^ 

^M a ^ e :E° S1 ^ most ^ ,,tains :a : *lock;- duplication ; that Us nearly as 

S^v e *fon^ shared,by,chromosome arm 2q 

Rations is ancient segmental .duplications^^ ;and chromosome 12^This duplication incor- 
. ; many cases, the order of the proteins has been, v -porates * two ■ of ,tiie ' four .known -Hox .gene 
.shuffled, although proximity is .preserved. ^clusters; but considerably expands the extent 
; Out/of the ,1077 blocks, 1591 contain only/ " : 

three, genes," 137 contain four genes, and 78 J ' 
...contain, five. or more genes. - : V 
^ To illustrate the extent of the ^detected 
duplications, Fig. .13 shows all 1077 block 
■duplications indexed to each chromosome in 
-24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
The figure makes it clear that the duplications 
are: ubiquitous in^the genome. One feature 



. : somal blocks of duplication ■ were observed, 
. all of which were also detected and expanded 
v? by me comprehensive mem^describe be- 
v low. The . detection of only a relatively small 
^ number of y block .duplications was a conse- 
quence of using an intrinsically conservative: 
; method grounded in the conservative con- 
straints of the complete Lek clusters. ; . ; ' 

In the second, more comprehensive ap- 
" proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a 
linear-time algorithm to align long sequences 
very rapidly; for example, two chromosomes 



of 100 Mbp can be aligned in less Han.'?p.Athat;it displays is many relatively small chro^ 



.min (on a Compaq Alpha computer) with' 4 
«. gigabytes . of memory. . This . procedure , was 
used recently to 'identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
1^ d ^ ca i e A^5J ent s. Fox Arabidoj?sis t .z 
DNA-based ali^^eht was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as follows. First, all 26,588 proteins 
(9,675,713 million amino acids) were concat- 
. enated end-to-end in order as they occur 
along each of the 24 chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against each chro- 
mosome by /the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam-' 
pie, small blocks that were spread across 
many proteins were removed. To refine the 



.mpsomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik- 
ing. One such example captured by the anal- 
ysis is the well-documented olfactory recep- 
tor (OR) family, which is scattered in blocks 
throughout the genome and. which has been 
analyzed for genome-deployment reconstruc- 



tv 



of the duplications proximaliy and distally on 
.-.s-the pair of chromosome arms. This breadth of 
^duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 
. An additional large duplication, between 
chromosomes 18 and .20, serves as a good 
,: example to ,. illustrate .. some of the features 
. common to many of the other observed large 
duplications (Fig. 13, inset).VThis duplication 
: cpntainsTo^y detect 
; vspmai pairs of homologous' -genes. VAfter dis- 
^counting a 40-Mb stretch of chromosome 18 
vfree of matches, to chromosome 20/ which is 
'likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
reH.on chromosome 4 8 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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Fig. 12. Gene duplication in complete protein clusters. The predicted protein sets of human, worm, 
and fly were subjected to Lek clustering (27). The numbers of dusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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- . - ; - By ^ this .^measure; . the ;durft cation ;^segm^ h-^yeal : ;the;sta^^ of our genome, artf 

v - ' :^'/ t spans nearly half of each xhrbmosome's ieiJv^iobs^^. in m^y.cbmpared re^or^. JH[ypothe-t : ^ jvtim it a-.riist^^ of many of 

Iv4 ; : - length.' -The most .likely > scenario is tlratjfhe;^ us from other 

: \)- ; whole span of this region was duplicated as, a.'.* ^processes must be tested.' . ^*-l t ;^-\^^ ' 5 

; : . ; . : ^single. very large block, followed by shufrlmg * •; j- ■ • ;; _ ; V. 

owing to , smaller ; scale "re^arrangements/;A^ 

f v f Aisuch, at least fbur~s\^equent;rea^ 'C- 

relative" insertions ano^mversTbris 

;^v^^Vduplicated segm 
' r ^||i'pairs. in this'alig 

tein" assignments -otf^ : cl3iirn6sofne^l8^;and :i:above (cb^ to 12, and 18 tb>:^ tween two. chromosomes was —1 per 1200 to v $ 



.i 




■{ large-scale Vduplica^ 

S quent gene loss on one.br to than^the-j]^ *that v affect" the- pre*- 4 

Loss . of just - one member ' of a '(gene pair ■ , ■{ duplication^ regions ( are, to each . pmer. Furtheri ,dicted >eoding regibhS/'This results in an cs* ? 

^subsequent to the "dupHcatibn would result/ in the correspon^g mouse chromf^mal regioris'^Wtimate :mat v> brily;^busands/ not millions^of J 
a failure to score a gene pair in the block; less '^V each, bear a significant proportion of genes -or-.}^?-j:gene^c"yariatioii'may contribute to the stoic- '\ 



than 50% ' gene ; loss on the chromosomes [k^ thologous ; to the! human genes * on which the ^ 
-would; lead to : me. duplication density ;ob--,^ On 
* served here. As' an . independent verification ; . ;;f" the.; basis! of these ^factors,/' the. ■ corresponding V; 
of the significance of the alignments detect- U;i "mouse,. ■chromosomal spans, at coarse resolu-; 
. ed, it can be seen that a substantial number of tion, appear to be products of the same large- ? 
the pairs of aligning proteins m &is duplica- 

tion, including some of those : annotated (Fig; ; though rurther detailed analysis must be carried 
,13), are those populating small Lek complete ■ ; ~out once a more complete genome is assembled - ... 

clusters (see above). This indicates that they \fbr mouse, the underlying large ^duplications . ; 
:.- are members of very, small families "of para- /"^appear .to predate, the two species^ divergence. ; > 
■..logs; their relative scarcity within the genome , : v the latest, before 

: . validates the, uniqueness and robust nature of , . ^'divergence of me primate and rodent lineages. \? 

their alignments. ■'■ . : : - . ; ■ = « ■ /: i jhis date can be further refined upon examina- : ; - , 
Two additional qualitative features were ob- . , \ tion of the synteny between human chromo- , 



served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in .Man) assignments, are members _of 
duplicated segments (see web table 2 on Sci- . 
ence Online at . www.sciencemag.org/cgi/con- 
: tent/full/291/5507/1304/DCl). We; have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further ^ investiga- 
tiori needed to determine whether they might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



somes and those of chicken, pufferfish (Fugu 
rubripes\ or.zebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 



tural. diversity of human proteins. 

*} Having a^ complete -genome sequence':cnnhlci v "% 
>!researchers ; to achieved dramatic accclcrcit ion 
.• in the rate of gene discovery, but only through ; !; 
, analysis of sequence jvariatiori in DNA can we ,4 
discover tiie genetic basis for variation in health ' 
.among human beings. Whole-genome .-shotgun , : 
■vsequencing ^is a . particularly effective mclhcid ^ 
<f fbr detecting sequence variation in ^tandem with; ; ; ; 
;>whole7ge^me ;^s^ 5bm%-^ 
Vpared^medisui^^^^ ■atiribu^VoT;SNp* i ^ 
^ascertah^ : by;mreb^^^ (i) align- 

; ( ment of the Celera consensus sequence to the 
; PFP assembly, (ii) overlap of high-quality reads 
rlof genomic sequence preferred to as "Kwok"; 
1,120,195 SNPs) (97)1 and (iii) reduced repre- 
sentation shotgun; sequencing (refesced to as 



.duplications are restricted to the Hox cluster J'v'TSC?; 632,640 SNPs); fp8). These data were 
} regions:., When : 4 .^.consistent in showing:ari : bverall nucleotide di- 



(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconyolution of duplications to eventually re- 



versity of ~8 ; >< 10i 4 , "marked heterogeneity 
across the genome in SNP density, and an 
overwhelming . preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 
Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an explicit 
sampling model (99). Comparison of consensus 
sequences in the absence of these details neces- 
sitated a more ad hoc approach (quality scores 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identified, 
these were then filtered to reduce the cona- 
tion of sequencing errors and misassembly. A 
a measure of the effectiveness of the filtenns 
step, we monitored the ratio of transition an^ 
transversion substitutions, because a 2:1 ra 
has been well documented as typical in mam- 
maHan - evolution ^100) and in human SNi. 
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Celera consensus was less Z 30 ^dwhS 

fce density of variants was greater than 5 in 4(£ . ^^^v^?- **• 
bp. These filters resulted in shifting the tn^ .' 'lf^^^7^^^Vopdatim ; 



. Sue. These data are not readily available so 
we could not estimate nucleotide diversity 
from the TSCeffort, Estimation of nucleo- 
tide •ffiyersity.^p^^iiifiiity. sequence 



6.2 comparisons 
databases 

vrdbSOT.^.ncbi.nlm.nih.goy/SNP)^ andVobSfSt^'^T^^^ 
^1345^trom,H(^.(Human.Gene ^ 
- tion . Database, from , the University of ' ^ite^lS^*^ 9 *^*- between . . nucleotide diversity; appeared to-vary across - 
v Wales, UK), were, mapped otvthe Celera^on- ^ ' 
, sensus sequence by , a sequence sim ilari^ 

^ wim ^ program PowerBlast^; TlS 
..two largest data sets in dbSNP are the Kwok ^ ate*, si S^S'^ the 

-par^n^yK ; 

■ A 2.1 transitionrtransvers on ratio for the X i 5 f .ynp^ u , " The 

> bona 'fide SNPs %ould W ft KfS^? " ; " : V " ^ xp ^ ed . to - be - less s vanable than au^ 
assumed ^. because for every four copies of 

ences in the Cele^mS XTa r^Sr ' ¥ P°P ulation . *«e are only 
(Presumably . random) ~ °^ ^?^T° S °™ S > and this smaller. ef-; : 



and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments : with partial 
coverage of the dbSNP. sequence . and aUgn- 
■ ; ments that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
, flanking sequence were eliminated. dbSNP se- 
quences mapping to multiple locations on the 
Celera genome were discarded. A total of 
.2,336,935 dbSNP, variants ; were mapped to 
1,223,038 unique locations on the Celera se- 
quence, implying considerable redundancy in 
dbSNP. ;: SNPs. in the JSC set /mapped . to 
585,81 1 unique genomic locations, and SNPs in 
the Kwpk set mapped to 438,032 unique loca- 
tions. The combined unique SNPs counts used < 
in this analysis, including Celera-PFP TSC 
and Kwok, is 2,737,668. Table 15 shows that a 



(presumably random) sequence errors. . 

6.3 Estimation of nucleotide diversity ' 
from ascertained SNPs ; '• \ 
The ; number ; of SNPs identified Varied 
widely across chromosomes.: In order to 
normalize these values to the chromosome 
substantialWnW^^^ size and sequence coverage, we used the 

these methods was Jl^^Zl ' ^o^^ . our,es^ 

^kandCelera-PFRSl^ 

to the iMehvWnf e ^:.: .t-_^ . f . P lu ^nuy .mat ,a pair of chromos omes densely reseouenr^ h^i, ' 

drawn from the population will differ at a i!^^ ' 1 ^': m W * S 

nucleotide site. In order to calculate nucle- 
otide diversity for each chromosome, we 
need to know the number of nucleotide 
sites that were surveyed for variation, and 
in methods like reduced respresentation se- 



--j — 0 — » w *« F w v._ /oj uctween me 
Kwok and Celera-PFP; SNPs may be due in part 
to the use by Kwok of sequences that went into 
the PFP assembly. The unusually low overlap 
(16.4%) between the Kwok and TSC sets is due 

* • 



Active population size, means ;that 
dri ft ^ijlvmore {rapidly ; ; remove;: variation 

...from the X (106). •- ; ; r ; ? : ^ : - 

• ; .vHaving . ascertained nucleotide ; . yariation 
Cgenome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples qf genes were reasonably 
' accurate (101, 102, 106, 107). -Genome-wide, 
our estimate of: nucleotide diversity, was 



Imd!. l 5 i ° f SNPs from genome-wide 

SNP databases. Table entries are SNP counts for 
each pair of data sets. Numbers in parentheses are 
the fraction of overlap, calculated as the count of 
overlapping SNPs divided by the number of SNPs 
m the srnaller of the two databases compared. 
Total SNP counts for the databases are: Celera- 
PFP, 2,104,820; TSC, 585,811; and Kwok 438,032 
Only un que SNPs in the TSC and Kwok data sets 
were Included. 



6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes . in SNP density 



ouencina we «pa'h v *u " y among chr Pmosomes. in SNP densi^ 

Table 16. Summary of nucleotide changes in different SNP data sets. 
SNP data set 





TSC 


Kwok 


Celera-PFP 
TSC 


188,694 
(0.322) 


: 158,532 
(0362) 

'". 72,024 
. . (0-164) 



Celera-PFP 

Kwok* 

TSCf 



A/G 


or 


A/C 


(%) 


(%) 


(*) 


30.7 


30.7 


* 10.3 


33.7 


. 33.8 - 


- 8.5 


33.3 


33.4 


-. 8.8 



8.6 
7.0 
7.3 



9.2 
8.6 
8.6 



10.3 
8.4 
8.6 



Transition: 
transversion 

1.59:1 
2.07:1 
1.99:1 



^rr. r- . — . 

2000 release of NCBI ibSHP hJ»j3^X^»^^^^^^ l *^ n Univerei *y- . • tNovember 
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. Rg.v13. Segmental duplies-" 
: " tions between .■ chromo- 
somes In -the human ge- 
, nome. The 24 panels show 
the 1077 duplicated blocks 
; . of genes, containing 1 03 1 0 
; ;.;- pairs of genes in totaL Each 
::- line represents a pair of ho- : 
^.mologous genes belonging ' 
* to a block; all blocks con-. 
V tain at least three genes 
;:;.on/-each of the chromo- ; 
^ somes where they appear. : : 
v.Each panel , shows all the 
-. duplications between a x -' 

single . chromosome .and 
. : other . chromosomes, with; 
v; shared blocks. The chro- ,\ 
vV'mosbme .at the" center t)f 5 
* \each panel is shown as a 
.■ , thick red line for emphasis. ' 
Other chromosomes are . 
displayed from top to bot- , 
. torn within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
dose-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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^ :somes s ^and ; whether U this ^heterogeneity --i^ the total JSNP 

' 'peata ^ihaii^expected J)yi.chan&/ : If : SNPs «^61<^iae;^vershies\?i^ ' Kwok 




fragments of arbitrary .constant 
■^-•served- dispersion inlthe distribution 
f-Sin dOQ^kbp ^fegmenteiwasvfar 

-predicted ;from ;a : .P6issbn t^disffibution : (Fig ■ 





?;'the genbrne;T6puiation genetics theory holds ^ah^riptiorviunit), 5 ' T ;r;f between^ 



;that : we can;accouht fbr/this variation with k^-; UTR^e^nicv(missehseT and ; silent);': inp^rates^^^ 
v rnathematrcal formulation Called ; me^neutral l£;faom^ 
> ^;,W'cc6alescentXi^:^An in in- 

^S^nthins /or. simulating :the ^. heutfal c'oaie^cent#^t^^ human^genes r predicteo!,from V#tro^(V0^ will 

y : -":^lwith recombination (110),: and imng7an\ef-;':^^ 
v^^fective population size of ,1 0,000 ; and >a per-iV^gions, ;SOTs >ver^ 

'y / base recombination rate equal to the mutation . V~ lent, ; ^ for' : ffiose" that^do not :change : amino ^4sbnie ;fraction^ 
. J rate (111) 9 we generated a distribution of num- ^ ■ acid sequence, ■or missense, 1 for those ;that .,; ^- : function as i welL : ".-; 

bersof SOTsby this modelas weU(7^ of . ^r"^;"^^ '- ^'^ ^ 

. • ',/Vobserved distribution of SNPs has a much larg%; missenseitd silent coding - SNPs in' Gelera- ; -r* 7 An Overvi ew of the Pred icted 

, er variance ma^ 

0.78, respectively) shows a markedly re- 
. duced frequency of missense variants com- 
■pared with the T- neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 



coalescent model, and the difference is highly 
significant This implies that there is significant 
variability across the genome in SNP density,^ 
an observation that begs an explanation. 



■ , Several attributes of the DNA sequence ;. 

; may, affect the local 'density of SNPsJ- in- aci (772).:These ratios are com- - 

> i; eluding the rate at. which DNA;pplymerase r t;^parable;io:^^^ of v 

v makes errors and the efficacy of mismatch ; 0.88 and ;i;17:^urid by Cargill et al XlOl) " 



repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 



0.05 



and by Halushka et ah (102). ^Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences (46). 

: It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref- 
Seq genes, missense SNPs were only about 



0.04 
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Fig. 14. SNP density in each 100-kbp interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandom "and is not entirely 
accounted for by a coalescent model of regional history. 



!\Proteln-Coding Genes in the Human 
Genome 

Summary, ": This section; . provides .an . initial 
■ computational analysis - of -the predicted 
protein set with the aim of cataloging 
r prominent : differences \ vand ; Zsinlilarities 
::wheii;the huihan^ 
x:6ther ,^ ^fully ^sequenced : e^a^otic ? ;genpmes . - 
: v O ver 40% Ibf , the :preSicf ed iprotein set in 
• humans cannot, vbe . 'ascribed ;a ; molecular 
• function by methods that assign proteins to 
known families. A -protein domain-based 
analysis provides a detailed catalog of the 
-prominent ^differences' in the human ge- 
nome when compared with the fly and 
vworm geriomes.Tromineht anibng" these are 
m domain expansions in proteins involved in 
^developmental regulation : and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
. predictions with at least twolines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114,115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are prelimi- 
nary and are-subject to several limitations. 
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Both, the gene predictions and functional 
. assignments have been made by using com- 
putational^ tools, although the statistical 
models in Panther, Pfam; and. SMART have 

£ pert biologists. In jthe set of computationally 
^'predicted -gen^we'^ expect both, false-positive 
. "predictions; (some : of these may in fact be inac- 
■ . nve /;P^ u ^°g^nes) and. false-negative predic- 
^tipns.^ hot be cprxiputa- 

N^ 0 ^^^?^ errbis in ■ 

■>;;deliiM^g the^^daries of exqns and genes. 
■;: jSimilarly, : m^e assign- 
;!™^>. e^ct boffi" false-positive and 

^:;^^ga^ as-. 
; signment protocol ^fq^es oh protein families 
that tend to be" found across several organisms, 
or on families of known human genes. There- 
fore, do not assign a function to many genes 
:v^V$5 ^U^jar^fami^ even if the fiinc- 
tion is ^own^M all 
t enumeration of the genes, in any given family or 
functional category 'was . taken from the set of 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
stined . for models in ; Panther, Pfam, • and 
SMART. 1 V" 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene 
products, and how are these proteins cate- 
gorized, with current classification meth- 
ods? (ii).What are the core functions that, 
appear to; be common" across the animals? 



v C 



(iii) How does the human protein comple- \/ these unknovvn-function >nes are not real 

!Sn2S ^ * " #tf$£W n(?ed -- ' 8 enes ' Given ■«* ™« of these additional 
y s - : •.•v-;.v:12,095:genes appear to ;be unique among the 

• human proteins vW^^ • 
r - ^v^^n^u^^ molecular'functions are 

•' "g^e 15 shows an overyiew,of ,the,puta-,- 5 '^the.transcription:faaors ind those involved in 
, tive : ^nolecular, fonctions/pf\ ^ predicted, ^mcleic^id m6tabolisrriVnucleic-acid enzyme) ' 
v ? 6 ' 5 ? 8 :human : proteins ; ,that:haye-ata.east'.^ 
,two -lines of suDDOrti'np evMpn'np' Ahniit t,„^„ „^'"_v-_ ' It. ; - v . ..• . /- . 



.>v , -r i ''.^ > r ■ ■'• " - " "fji",™ 1 " AV" 1 " 1 ?^*- ;^oi-.swpnsingiy,. most or the 
not *e ; classifi ed;:frpm^is3mitiaUanalysis^ 

.... and ..are termed .proteins 
■fimcbons. Because. our,autpmatic^classi.f^^ 
.cation methods treat only .relatively ; large : . > toiy molebules": (i) proteins involved in specif- 
protein^famihes, there, are a snumber of ;ic.steps of signal transduction such as hetero-- 
unclassified" sequences ,that^do, "in "iact, - , ;trimeric OTP-binding proteins (G proteins) and i 

^of a r ^ °f P^ di 9 ted Anctionv^ ©proteins that mod- 

; .60 /o of .the protein .set that.Jiaye automatic .^: Tilate.: the activity of kinases, G proteins and 
.functional predictions, ;^spMi^ " 



functions have been / placed Mntp ?,broao^ ~, 
.classes. :We focus jiere on:moiecuiar\fanc-.^ 
tion (rather than higher ■ order cellular pro- & . genomic regions. " -r :. :/. ' ^ : 
cesses) in order to;classify as many proteins , . ; ' ' - . ■ V - ' ' ! "' - v.: 
as Possible. ; These. funct^ ... v £ ; ^^ S ize pf : ;^.Celera-PFP ' 

: SNP 
density 
; (SNP/Mb) 



as possiDie.;;inese.; : mnctional predictions , ■ ' ..• , -^'i^USize ofvf 
are ; based on similarity to sequences, of :• J- Genomic region '■?* -region ^ 



examined 
(Mb) 



known function. class 

In our analysis of the 12,731 additional low- 
confidence predicted genes ; (those . with' only 
one piece of supporting ^evidence), only :6i6 
(5%) of these additional putative genes were 
assigned molecular functions by the automated 
methods. One-third . of .these . 636; predicted 

genes represented, endogenous ;retroviral :prb- ^: : ; • • '.^ ^Zy";^;?! 
teins, further suggesting that -the 'majofity^;:: 6XOn " rf " ' W 



Intergenic , ; 
Gene (intron + - 

exon) 
Intron 

First intron / 
Exon 



2185 
646 

615 
164 
31 



917 

: ■ . j . .. _ 

.. 921 

. O;V,' ; *\808' 



, t — 



cell adhesion (577, 1.9%) 
miscellaneous (1318, 4.3%) \ chaperone(l59, 0^%) ■« 



viral protein ( 1 00, 0.3%) 
transfct/carrier protein (203, 0.7%) 



nucleic acid enzyme (2308, 73%) 



signaling molecule (376, 1.2%) 



receptor (1543, 5.0%) 



kinase (868, 2.8%) 



select regulatory molecule (988 t 32%) 

transferase (6 1 0,2.0%) 
synthase and synthetase (313, 1 .0%) 
(Kidorcductase (656, 2.1%) 




(yasc(H7,0.4%)^ ■ / J 
ligasc(56,02%)/ ' 
Isomerasc (163, 0J%) 

hydrolase (1227. 4.0%) 



cytoskclctal structural protein (876, 2.8%) 
extracellular matrix (437, M%) 

immuw)globuIin.(264, 0.9%) ^ . J." 

ten channel (406,! 3%) 
motor (376,1 2%) 

structural protein of muscle (296\ 1.0%) 
protooncogene (902, 2,9%) 

select calcium binding protein (34, 0.1%) 
intracellular transporter (350\ |.1%) 
transporter (533, 1.7%) 



GO categories 



* J" - . X . _> L 



molecular function unknown (12809, 41.7%) 

Panther categories . _ 
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Fig. 15. Distribution 
of the molecular 
functions of 26,383 
human genes. Each 
-slisejists-the-num^ 
bers and percentages 
(in . parentheses) of 
human gene functions 
assigned to a given, 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene" Ontology. 
(GO) (779), and the 
inner circle shows 
the assignment to 
Celera's Panther mo- 
lecular function cate- 
gories (776). 
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^P* e ^(^),P^ (720) -deal with ;this case; by; analyzing 

^B^oniecompariscms^ - 
? "^^ e enumer^ed 1^ all of the sequences Jn to 

J ;Lseryed be^^^ looked for parrs' of genes ^Compared vwith c ithe :WHple vhuman : set (Fig. ~? 

^um^-jnfl nearestneighbors in the tree/If ^e|% 

^^^^ ■ vW^U™ -6}? ^g"! JuncfioS ; that nearest-neighbor ^ pairs ; were * from >: different j ± represeuW factor of" = 

^*PP^!i£^ those genes were presumed to be ^ 

v ; The ; concept ofcbrthblogy, ^ y^orthologs.: We mote that these nearest nejgh-;>;^nz^ ma- 7 

: v cause : .if two genes are orthologs, they can be -rbors.can often be confidently identified from / * chinery (no&biy tlWA/RNA rhethyltrans- ! 
traced by descent to the common ancestor of ; .pairwise sequence comparison ;withqut hay--: '^fei&es,"^ % 
the . two, organisms 7 ^ leg->^DNA;Uigas^^^ 
; served protein set"), and therefore are 1^ ' - 

v not from different organisms, ; there . has been 
■-. a paralogous expansion in one or both .organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
; correspondence is lost, defining an ortholog 
_ .. . : _ ^ . . • becomes ambiguous. For our . initial compu- .; : . appear to be ; conserved Jamc^^e'anmiajs^^^ 

a duplication ev^ : be^ overview of the predicted human pro^ ; ^^^ 

,7 subsequently diverge, in .^ction, FoUo^g j^tein set; we could not answer this Question, for ed ; (trarisferases^oxidore^ 
: the . yeas>wor^ protein. 'Therefore;^^ 



to perform "similar conserved functions in the 
different organisms. It is critical in this anal-, 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 
in more than one copy in a given organism by 



factors, -nucleases,/ and ribosomal .proteins). 
/The -basic transcriptional and ; translatiorial 
; machinery is well laibwn to have been con- 
served oyer evpluddn 

to the most complex eukaryotes. Many ribo- 
nucleoprbteins involved in RNA splicing also 



nucleic acid enzyme (221, 12.9%) 



Fig. 16. Functions of putative 
orthologs across vertebrate 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs" between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular function.^ 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BIASTP P-value of =£1CT 10 
(720), and (ii) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
Ism, i.e., there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of . 

fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in 
common between these sets). 



cytoskcletal structural protein (20, 1 .2%) 

, chapcronc(16,0.9%) % 
cell adhesion (1 1,0.6%), 

miscellaneous (72, 4,2%) 

. vira! protein (4, 0.2%), 

transfer/carrier protein (1 1,0.6%) 

transcription factor (8 1 , 4.7%) 



extracellular malrix (12, 0.7%) ••• *■ •• • • 
ion channel (7, 0.4%) 

motor (13, 0.8%) : ; ; : r^—^;^ V^- ^ 
structural protein of muscle (8, 0.5%) 
> protooncogenc (23, 1 .3%) 

intracellular transporter (51, 3.0%) 

transporter (44, 2.6%) 



receptor (23,13%) 
kinase (69, 4.0%) 




transferase (70, 4.1%) 

t - ■ 

synthase and synthetase (64, 3.7%) 

oxido reductase (64, 3.7%) 

r/asc(I2,0.7%)' / / hydrolase (80, 4.7%) 

* * MM' 

Iigasc(9.0.5%) isomcmsc(2l, 1J2%) 



molecular function unknown (613, 35.8%) 
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;Wthe largest,^ 
.^eachof.these.three^^ 



context . of specific cellular*, processes ;that 
were. likely derived from the last common 
ancestor of the human, ily, and.worm. As 
stated before, this. analysis does not provide a 
complete estimate of conservation across the 
three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
,'tholpgs difficult withiii the members of con- 
served protein families; , : " v : \ 

73 Differences between the human 
genome and other sequenced 
eukaryotic genomes 

To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human^genome, witfe the -other - sequenced 
eukaryotic genomes at three levels: molec- 
ular functions, protein families, and protein - 
domains. . 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes that 
are unique to the vertebrates. Tables 18 and 
19 display a comparison among all sequenced 
eukaryotic. genomes, over selected protein/ 
domain families (defined by sequence simi- 
larity, e.g., the serine-threonine protein ki- 



transductio^ components associated with cy- 
- tokine ..receptor^ signal transduction t are also 
features that are poorly represented in the fly 
and . worm/ These '^include ; protein domains 
found in the signal transducer and activator of 



Other human expanded gene families play 
key roles ■ directly . in neural ^structure and 
function. ;One example: is synaptptagniin (ex- 
panded more than twofold in humans relative 
to the invertebrates), originally found to reg- 



transcription (STATs), the suppressors of cy- .. ulate synaptic transmission by serving as a 
tokine signaling .(SpCS), and protein inhibi-. : v Ca 2+ sensor (or receptor) durmg^aptic 
tors pf;actiyated;STATs (?IAS) --^ 



• many of the, animalrspecific protein domains 
that play , a role . in ;innate immune response, 
such as the Toll receptors, do not appear to be 

• significantly expanded in the human genome. 
, Neural development, . structure, and 

function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked-increase-in : the number of members 
of protein families ;. that .are i involved in 
-neural development;: Examples include neu- 
rotrophic factors such as ependymin, nerves 
growth factor, and signaling molecules 
such as semaphorins, as well as the number 
of proteins involved directly in neural 
structure and function such as myelin pro- 
teins, voltage-gated ion channels, and syn- 
aptic proteins such as synaptotagmhv 
These observations correlate well with the 
known phenotypic differences between the 



nases) and superfamilies (defined by shared • nervous systems of these tax'a, notably (i) 



molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryote, genomes. 
We have found that the most prominent hu- 
man expansions are in proteins involved in (i) 
acquired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



the increase in the number and connectivity 
of neurons; (ii) the increase in number of 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) (121); (iii) 
; the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



; the^ increased co-occurrence ^m ; humans of 
PDZ ;and the SH3 ? , domains in "rieurbnal- 
; specific adaptor molecules; examples include 
proteins that likely modulate channel activity 
at synaptic junctions (128). . We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
l (related to cyclic ^nucleotide; gated chahrieTsTT 
; the.: yvoltajge-gated 

^family,' the ■ inward-rectifier potassium chan- 
nel family, and the., voltage-gated. potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin PO is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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/Accession 
^number 



;'-;-PF02039. 
;.^PF00212 
!*PF00028 
,PF00214 

;;pfoiiio 

; 1PF01093 
PF00029 
; .PF00976 
'• K PF00473 
' PF00007 
PF00778 
T:PF00322 
PF00812 
: PF01404 
PF00167 
PF01534 
PF00236 
PF01153 
PF01271 
PF02058 
PF00049 
PF00219 
PF02024 
PF00193 
PF00243 
.PF02158; 
- PF00184 - 
: PF0207O 
PF00066 
PF00865 
PF00159 
PF01279 
PF00123 
PF00341 
PF01403 
PF01033 . 
PF00103 
PF02208 
PF02404 
PF01034 
PF00020 
PF00019 
PF01099 
PF01160 
PF00110 

PF01821 
PF00386 
PF00200 
PF00754 
PF01410 
.PF00039 . 
PF00040 
PF00051 
PF01823 
PF00354 
PF00277 
PF00084 
PF02210 
PF01108 
PF00868 
PF00927 



Domain name 



0 



I +• 



^AdrenomeduUin Vv.vAdrenomedullin' V ^ '• ■-.-ft--*'- -^.f., 1 * ^:^^>^^7^r^ 

<ANP r . 7 >. L -. ^i-; Atrial natriuretic peptide ^ ^ /. 1 • ^ '■.U'^'-^''0^2 ^^-'^-n 
Cadherin 0 - : V ^Cadherin domain : ; : ^ ^ 

;Cal<LCCRPJAPP ro ^;Calcitohin/CGRP/IAPP family : V ,1^ ' ^! n 



Cadherin 

Calc_CCRPJAPP o 
CNTF 
^Clusterin V-\- '. 
Connexin 
■ ACTH_domain \ 
7 CRF V i — 
' Cys_knot 
DIX 

Endothelin . ■ ' ' " 

Ephrin 

EPhJbd 

FCF 

Frizzled 

Hormone6 

Glypican 

Granin . ; 

Guanylin u ; 

Insulin 

IGFBP 

Leptin 

X(ink 

NGF 
. Neuregulin 
: HormoneS ~ 
. NMU 1. 

Notch 

Osteopontin 
Hormone3 
Parathyroid 
. Hormone2 
PDGF. ; rrr .^ 
Sema 

Somatomedin's 
Hormone 
Sorb 
SCF 

Syndecan 
TNFR c6 
TGF-p 
Uteroglobin 
Oplods^neuropep 
Wnt 



ANATO 
Clq 

Disintegrin 
F5_F8_type C " 
COLFI 
Fnl 

Fn2 . - 

Kringle 

MACPF 

Pentaxin 

SAA_proteIns 

Sushi t 

TSPN 

Tissue_fac 

Transglutamin_N 

Transglutarnin_C 



Cadherin, domain 
Calcitohin/CGRP/IAPP family 
Ciliary neurotrophic factor 
: : Clusterin : : . - ; 
. Connexin / . \v 
, ^Corticotropin ACTH domain 
";;^v ;Corticotropin-releasing factor 'family T [\ y 
% -Xi Cystine-knot domain X. '. , 
Dix domain 

• Endothelin family | ;V ' : ; 
Ephrin 

Ephrin receptor ligand binding domain 
. .Fibroblast growth factor ' 
v -Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Glypican 

. ; Grainin (chromogranin or secretogranin) 
: — Guanylin precursor 

Insulin/JGF/Relaxin family 
Insulin-like growth factor binding proteins 
Leptin 

LINK (hyaluron binding) 
; Nerve growth factor family - 
Neuregulin family 

.Neurohypophysial hormones * ; ,« . • ^ 
Neuromedin U - 
Notch (DSL) domain 
Osteopontin 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 
. .Platelet-derived growth factor (PDGF) " 

Sema domain 
1 Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 
Stem cell factor 
Syndecan domain 
TNFR/NGFR cysteine-rich region 
Transforming growth factor p-like domain 
Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

Hemostasis 

Anaphylotoxin-ltke domain 
Clq domain 
Disintegrin 
F5/8 type C domain 
Fibrillar collagen C-terminal domain 
Fibrpnectin type i domain . 
Fibronectin type II domain 
Kringle domain 
MAC/Perforin domain 
Pentaxin family . - 
Serum amyloid A protein 
Sushi domain (SCR repeat) 

, Thrombospondin N-terminaWike domains > t : ' ' V ^ 
Tissue factor " ■ : - — "i 1 1 - 1 ~ Q 

- Transglutaminase family ~ " g" • -j 

Transglutaminase family 8 1 



100 (550) 
3 

■": ' : xi 

3 

14(16) 

\[- -.-- J 
2 

10(11) 
5 

' 3 

7(8) 
12 

v 23 
9 
1 
14 
■ 3 
1 
7 
10 
1 

t 13(23) 

': : i,:3 
' .'' . 4 
. . T. 
1 

3(5) 
1 
3 

5(9) 
^5 
27(29) 
5(8) 
1 

2 1 
2 
3 

17(31) 
27(28) 
3 
3 
18 

6(14) 
24 

, . .18 , 
15(20) 
.10 

5 (18) : 

11(16) . 
15(24) 
6 
9. 
4 

53(191) 



{•-:. o 

■ a • :0 

;v0 

:^:.: ; ;0 

: 0 

■J ■ i • w 

2 

•: ...V2: 

7 
0 
2 

.:o 

' 4 

0 

/r.r.. 0 

y^. , o ; 
i--v.;o; : 

.0 

0 

/ o 

2(4) 
0 
0 
0 

, 0 

1 

8(10) 
3 
0 

,:o 

0 
1 

... 1 
6 
0 
0 

7(10) 

,0 
0 
2 

5(6) 

,- 0 : 
. 0 
.0. 
2 
0 
.0 

: 0 

11(42) 



0 ^ 



0 

< ■ 0 
^16(66) 
0 
0 
0 

o 

.;--0 
; 0 
0 
• 4 

0 
4 
1 
1 

3 
0 
1 
0 
0 
0 
0 
0 

1 

0 

6 

0 
0 

2(6) 
0 
0 
0 

o 

0 a 
3(4) 
0 
0 
0 
0 

1 

0 
4 
0 
0 



0 
0 
3 
2 
0 

' 0 
0 
2 
0 
0 
0 

8(45) 
0 
0 
0 
0 



~ - - til ^, . 



0/ 
0^ 
0 ? 

0 

o 

o ■ 

o s 

0 

; o 

0 
0 
0 
0 
0 
0 
0 
0 
0 

0 ~ 
0^ 
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0 
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0 
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0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 

o 

0 
0 
0 
0 
0 

0 ■ 

0 

0 

0 

0 
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0 
0 
0 

AO 
0 
! 0 
0 

■ * 

: 0 

o 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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0 

6 

0 
0 
0 

b v 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
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0 
0 
0 
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0 
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Accession " ' . . ■ 

-number; :V*^R?™ ai " "? me - **r^.- ^HjH ^ v>-'F'''' - 

•»;..,/■. ■- : . * : ] 777 r*T — : — ; — — ~- — -r ■ ■ ■ ' * • 

- - n - 

/ > ■ 

PF00748 : ; Calpam.inhib ; ; : .. , Calpain Inhibitor repeat 0/ ... .-• 1/3(01 -.: ^ . 

;^PF00666 ^ 

-DrS ° ^ qj M ,eta * :^ - aass II histocompatibility antigen; ^ domain-- ^ -^ I; 7.;V Hv^ -V,^o - 
PF00879 . , Defensin_propep . .. Defensin propeptide , • . ; . y>,;.... A 3':-;,.^.,; • 

oc^t : . • • Cran y loc yt e -macrpphage colony-stimulating factor r '1^.^^. 0 A 

rnjuw Ig . v--. • -- . Immunoglobulin domain' . c;3«l (930T w'fPQiV' 

, P 00143 . Interferon Interferon alpha/beta domain - V - ^ ! K ■ 

,PF00714 IFN-gamma Interferon gamma "■ . | \ v " 

PF00726 IL10 lnterleukin-10 - \ J ^ 0 

PF02372 IL15 lnterleukin-1 5 ' : ' , \ ° 

:r;„PF0p715 :::;IL2 r r - lnterleukin-2 - ^ : r^-, v,:.^^; . } • ^ -^v^ : , 

. PF00727 - IL4 : ; ; ' \ ; . ' lnterleukin-4 ■ - ' " ^ ~ - ' J ^ : -^^ :>T'T • ; ^ : 

PF02025 IL5; . lnterleukin-5 * l . : : :^V^ , ... >, 0 

PF01415 IL7 . . lnterleukin-7/9 family : , .. : . ... \ ...u., 

PF00340 IL1. lnterleukin-1 . 7 q 

PF02394 ILI^propep : lnterleukin-1 propeptide 1 -. n 

PF02059. IL3. ; . lnterleukin-3 0; 

Dcn?^? 116 lnterleukin-6/G-CSF/MGF family" : '.. ^ Z r ' ' " . 0 ■ 

PF01291 LIF_OSM Leukemia inhibitory factor (LIF)/oncostatin (OSM) 2 0 

family ■ 

PF00323 Defensins Mammalian defensin 2 n . 

PF01091 PTN.MK PTN/MK heparin-binding protein ' . // : .. 2 'I:.:" .V^.q . 

PF002 77 SAA^proteins Serum amyloid A protein 4 ' " 0 

PF00048 IL8 Small cytokines (intecrine/chemokine), 32 . 0 

interleukin-8 like . ' . 

PF01582 TIR v. ■ ; TIR domain 18 / - ^ - 

. PF00229. TNF. TNF (tumor necrosis factor) family : . tj? ,i -. 0 v 

• PF00088 Trefoil : : Trefoil (P-type) domain - '.. V. ^sffi?. ^ ^ 

PF00779 BTKi; - ^ BTK motif : ^ C^Pase signaling 

PF00168 C2 C2 domain . 73f101) 32f44l 

S ^ GKa Diacylglycerol kinase accessory domain (presumed) 9 4 

™ 7 ?; ^GKc Diacylglycerol kinase catalytic domain (presumed) 10 ■•. 8 

PP00610 DEP Domain found in Dishevelled, Egl-10, and 12(13) 4 ' 

Pleckstrin (DEP) . ' 

IS^- ■ — • ^ ^ iG ^ nger ^ " - ' - ^. ^-28(30)^,^14-,. 

PF0099^ XJDf . . GDP dissociation inhibitor 6 2 

PF00503 G-alpha . G-protein alpha subunit . 27(30) .'10 

PF00631 G-gamma G-protein gamma like domains 16 - 5 

PF00616 RasGAP GTPase-artivator protein for Ras-like GTPase 11 - : 5 

PF00618 RasGEFN Guanine nucleotide exchange factor for Ras-like 9 2 

GTPases; N-terminal motif 

PF00625 Cuanylatejdn Guanylate kinase • . 12 8 

189 ,TAM Immunoreceptor tyrosine-based activation motif . 3 0 

PF00169 PH PH domain . , 193(212) 72(78) 

PF00130 DAG_PE-bind Phorbol esters/diacylglycerol binding domain (C1 ■ 45(56) 25 31) 

• domain) : ■ s - 

PF00388 PI-PLC-X Phosphatidylinositol-spedfic phospholipase C,X . 12 3 

domain 

PF00387 PI-PLC-Y .* Phosphatidylinositol-spedfic phospholipase C Y 11 2 

domain 

PF00640 PID Phosphotyrosine Interaction domain (PTB/PID) 24f27) 13 

, PF02192 PI3ie P 85B PI3-kinase family, p85-binding domain 2 1 

I™ 0 ™ 4 . . p,3 K-rbd PI3-kinase family, ras-binding domain 6 3 

PF01412 ArfGAP Putative GTP-ase activating protein for Arf 16 9 

PF02196 RBD RaMike Ras-binding domain . 6f7) 4 

PF02145 Rap.GAP Rap/ran-GAP - } ^ - 4 - • 

n^?I? 8 ^ Ras association (RalGDS/AF-6) domain 18(19) 7f9) 

. PF00071 Ras , < Ras family t V26 - ' 56 f57 - • 

- PF00617 RasGEF . . RasGEF domain ; . 21 ^ 8 - 

1 Dcn^ilv' nn S ' '- - Regulator of G protein signaling domain L. ... . ^27 L". - 6(7) " s - 

PFOZ197,, RHa . Regulatory subunit of type II PKA R-subunit ' ' 4'. 1 . * 
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--<r 0 
' 0 

67 (323) 
0 
0 
0 

...... 0 

O 0 
0 

v 0 
0 
0 
0 

. ' 0 

0 
0 

0 
0 
0 
0 

' -' . 2 

0 

2 
0 

24 (35) 
7 
8 
10 

— is„, 

1 

20 (23) 
5 
8 
3 

7 
0 

65(68) 
26(40) 



11(12) 
1 
1 
8 
1 
2 
6 
51 
7 

12(13) 
2 



J .0 

^0' 

0 

^ 0 
- 0 
0 
0 

V 0 
0 

.0 

Ci 0 
; 0 

0 

0 
0 



0 
0 

b 

"0 

- 0 
0 

6(9) 
0 
2 
5 

~w 5 
1 
2 
1 
3 
5 

1 
0 
24 
1(2) 



0 
0 
0 
6 
0 
0 
1 
23 
5 
1 
1 



0 

;o 
0 
0 
0 
0 
,0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 



131 (143) 
0 

V 0 
0 

66 (90) 
6 

11(12) 

. 2 

— 15 
3 
5 
0 
0 
0 

4 
0 
23 
4 

8 

8 

0 
0 
0 

15 
0 
0 
0 

78 

*: 0 

♦ * * 

0 
0 
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Accession 
-number 



; aa/Doiriain name 3 W^^HVDorta!n«pti^ 



- w 




^pfooszo 

, PF00621 
: PF00536 
PF01369 
y: PF00017 
:V,PF06018 
PF01017 



RhoGAP 
RhoGEF 
SAM " 
Sec7 

r SH3 
STAT 



RhoGAP domain '. 
^RhoGEF^domain ^ . ...... 

i/SAM.domain (StenUe alpha nriotiO 
', i Sec7 domain 3- ^ J": ^ - 
^ Src homology 2 (SH2) domain 



S?9 (31) 
■ ' 13 
87(95} 



, ; ,19 

23 (24) 
-15 
5 

33(39) 



PF00790 , , . VHS 
PF00568 WH 1 





1 f il \ 

> PF00452 
. . ; PF02180 
PF00619 
- : PF00531 v 
r PF01335 
/PF02179 
. ' PF00656 
•"• PF00653 

'i ( 

, PF00022 

PF00191 
.: PF00402 
PF00373 
PF00880 
. PF00681 
PF00435 
PF00418 
PF00992 
PF02209 , 
PF01044 

" PF01391 
PF01413 

PF00431 
PF00008 
PF00147 

PF00041 : 
PF00757 
PF00357 
PF00362 
PF00052 . 
PF00053 
PF00054 
PF00055 
PF00059 
PF01463 
PF01462 
PF00057 . 
PF00058 
PF00530 
PF00084 
PF00090 
PF00092 . . 
PF00093 
PF00094 

PF00244 . 
PF00O23 
PF00514 ~ : ' 
PF00168 
PF00027 
PF01556 
PF00226 
PF00036 
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myelin proteins result in severe demyelina- tion'(five myelin PO, three myelin proteolip- :.~ Intercellular and Intracellular signaling 
Tw? V J*? 10 1°*? V COnd,tl0n " " id - myelin basic protein, and myelin-oligo- pathways in development and homeostasis 
which the myelin is lost and the nerve con- = ; 'dendrocyte glycoprotein; or/MOG); arid pps-,; , iMany protein, families that have expanded in 

,,ducaon is severely, impaired («0),Humans , . sibIy : :mpre-remotely : related members :of the humans relative to ; the : m^erVebrates are in- 

.,ha^^ ; kast ; aO,genes.^longing.^ 

r : different famihes involved m myelin produc- ^proteolipidV and worms have' none at ^.-^diis^ito^iap^:^ differentiation 
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FKBP-type pepticryl-prblyl cis-trans jsomerases 
^ GAF domain ^ - 
; : Kelch motif ^: \j 
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PAS domain ' v -'y 

PDZ domain (Also known as DHR or GLGF) 
PH domain 

PPR repeat t • ■ 
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Sec7 domain" "*y ■ 1 r''''-~ : 

; Src homology 2 (SH2) domain \ 
• Src homology 3 (SH3) domain 

STAS domain : ' 

TPR domain . J • • , - > ' > 

WD40 domain , ■ 

WW domain — ;/ ' ■■■ " , '-V- ' 
ZZ-Zinc finger present in dystrophin, CBP/p300 

' Nuclear interaction domains 

A20-like zinc finger r 

- ARID DNA binding domain 
BAH domain 
B-box zinc finger 
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Myb-like DNA-binding domain 
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RFX DNA-binding domain 
RNA recognition motif (a.k.a. RRM, RBD, or RNP 
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SAP domain - y , ; 
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START domain 
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•-Table 18 {Continued) 

x ^ - * 

Accession' 



•j-ji^v- j- ....... 1 1 -. „ . • — ' . '• - 

■*■.■ i Accession •~~?~- r - ■ -'^ .?*■ .•'■.-->-. i- ,•■■« . - v ... * (*f * -« ■■.-•*-«... -■ -.' 

/ 1 ^..n^rnber;: ! .X;;.^^,^^ A-y :vVv >'v-; v.ppmain-description I-". :^y;r.m < jvr v H 



ij PF02135 ^ <-Zf«tAZ ; v : ,TAZ finger 

PF01285 , TEA ^ : TEA domain 

s " PFp21 76 ;; .^':vZf-TRAF -syte V ^ ^TRAMwe zinc^n eer 

Z PF00352 ' TBP . : .* : . . JJX ^r^^i^'J r.^'J? 



r * !'.■.*■■*. .."•>'iW ■ ■ ■»-■"■ v * - * ■ 
- 2(3 lr:- v Vi: ^ (2),;,;i * 6(7) ,0 ; ? V 



10(15) 



PF06'567 : /^ TUDOR 
PF00642 -V Z^CCH ; 
" PF00096 Zf-C2H2** V 
%. PF00097M$ Zf-C3HC4 j X 
, : PF00098 : ;'ZNCCHC ' 



-Transcription factor T 
^vprotein;TB^ 



• . , v i-.jtS- v ^^-:7-^-!- : ' W-.^ A.-YIM V^VVi^ -.U .-'10 (15) 



- TUDOR rfftmain ; ■ •:■ ---^i-irlt ^^-rCVt^-t,-!. ...... -f ■. -v^.. •• - ; i--;. ■ 

. ,i uu»ui\ aomain .-. . - : - - , , j2^A : ■ , .:, ■ q q\ : ; v « >^ /(-V : * • ' a ;i r 

: ^!"" fl^C-x^xS^-xS-H^ype (and simila^ X : c" ' 17 (22 '.^ v:>6 (8 ^^ZZte ^> 3 W ^ ; ^ ^31 




^^ 5 i!^ ^ C / 5 x ^ ( 89 ) i: ■ 18 SJi298 (304) 
••^• 9 0- 7 ) • : : ;>6(10) . v ; 17(33) ^7 (13) 9^:68 (91) 




. (.Tables 18. and 19). "They include, secreted 
f ; -:hormones and. ^owth factors, receptors, ; in-: 
g5lracellular signaling molecules; and transcript 
tion factors. , . ' ■■ /: ' ■ \ 

Developmental signaling molecules that are, 
, ^ enriched in the human genome include growth 
. factors such as wnt, transforming growth fee- 
tor-p (TGF-P), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- 
.toskeletal and nuclear regulation/The corre- 
.. sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
:; ; pie, our .analysis suggests at- least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 12 
' ephrin receptors (2 in the fly, 1 in the worm).. In 
. the wnt signaling pathway, we find 18 wnt ' 
family genes (6 in the fly, 5 in the worm) and' 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
corepressors downstream in the wnt pathway 
r ,.,are even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in 
the worm). 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(13 J). Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (132), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative to worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (735). A similar expansion in humans 
is noted in _ structural proteins mat constitute the 
actin-cytoskeletal architecture; Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- - 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



^oti^ 

^^eavprotem farmhes and dormms;mvolved^^, genomes of 
, ; ^In,particular,,s|pa 

^playing iples m developmental regulation and^Pand ^ ^oma^ 

- ,nched,There .is; a factor of 2 ; or. greater ex- .^-factors compared ^ .^e^ceUdar^ eu- 
\ P^on m humar^m^ .the ;Ras.su P erfamily ; ^karyotes r and ^repertoire: is limited to the 
.GTPases and the GTPase activator and GTP:y .expansion of the yeast-specific C6 transcription 
c exchange .factors associated f with;them; Al- ^ : fector family ^ involved m metabolic regulation. 



.though there are about the .same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
: the ,SH2, PTB, and ITAM domains involved 
C: -in ,ph6sphotyrosine sig^ 
: ther, there :is ! a ;twofold expansion' of -pho 
■ phodiesterases - in.. the . human .genome. -com- :: 
.-.pared with either the worm or fly genomes. : . 

■ The downstream effectors of the intracellu- 
i lar signaling molecules include the transcription.' 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- -. 
binding nuclear hormone receptor class oftran- 



While we have illustrated expansions in a 
subset of signal transduction molecules in the 
human genome compared with the. other eu- 
karyotic :genomes,vit -^should ; ibe /noted { that 

; ™ ost ;Pf ^^i? L^^^i^ are liighly con- 
r;Served.^>An ;^intel^ 

% worms and humans ^haye . approximately , the 
;:same; number^of b6th ^tyrosine kinases and 
; serine/threonine kinases (Table 19). It is im- 
. -portent to note; however, that these are mere- 
ly counts of the catalytic domain; the proteins 
. that- contain these domains also " display a 
.wide , repertoire :6f interaction domains with" 



.sepphon factors compared with the,fly genome, ^giafimt ^biittbHal^visitv. 



although not to .the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not . only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN, do- 
: mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BIB with C2H2 in the fly and humans, and 



HemostesiSi-'Hernpstasis is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there has been extensive. re- . 
cruitment of fm)pe-ancient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidorhain proteins that are 
involved in hemostatic regulation. Although we ; 
do not find a large expansion in the total num- , 
ber of serine proteases, this /enzymatic, domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the kinin and complement pathways. There is a 
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metaUoprotease) and MMPs fmatrix metals ^ «rs. Ctoe of tbe most ? may account for : this apparent expansion. 

proteA^lS),^ 

hilar n*trk(ECM) proteir* is critical for ' mM ™ 

development ana for tissue de^dationmdis-im 

ease, and a variety- of iMammatory conditions :many,retrotrans- • >to.haye other functions. It has a second cat- 



vascular matrix ,co^ 
have been shown to .cleave , matrix': proteins, 

and even ^signaling ; molecules:HM)AM-17 <v E Pendymin * ££f ' ' ft : '-f 

converts -tumor ..: necrosis factor-^*, > and v?» on cnanne k - ' "•■ 

" Acetylcholine receptor 



|- ADAM-10 has been implicated in 'the Notch 
signaling pathway (735). We have identified 
19 members of the matrix metaUoprotease 
family, and a total of 51 members . of the 
ADAM and ADAM-TS families. t£ A-* 
1 ' " Apoptosis. Evolutionary conservation of 
, - some, of the apoptotic pathway /components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and 
. regulatory enzymes (137). We enumerated 
'. the .protein counts of central adaptor ^ and ef- 
fector enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya arid relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18). 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains, .are vertebrate-specific, whereas pth- ; 
ers like BiR,i^ARO, and Bcl2 are represent- 
ed in the fly and worm (although the number 
of Bcl2 family members in humans is signif- 
icantly expanded). Although plants and yeast 
lack the caspases, caspase-like molecules, 
namely the para- and meta-caspases, have 
been reported in these organisms (138). Com- 
pared with other animal genomes, the human 
genome shows an expansion in the adaptor 
and effector domain-containing proteins in- 
volved in apoptosis, as well as in the pro- 
teases involved in the cascade such as the 
caspase and calpain families. 

Expansions of other protein families. 
Metabolic enzymes. There are fewer cyto- 
chrome P450 genes in humans than in either 
the fly or worm. Lipoxygenases (six in hu- 
mans), on the other hand, appear to be specific 
to the vertebrates and plants, whereas the lip- 
oxygeiiase-activating proteins (four in humans) 
may be vertebrate-specific. Lipoxygenases are 
involved in arachidonic acid metabolism, arid 
they and their activators have been implicated 
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lane receptors. {149). Although there is no 1 
significant numerical increase in the counts - 
for domains involved in nuclear protein mod- 
; " l^™??? ^ 16 ^ a „ num ^er .pf domain ar- 
•:: ^ : rangements in the predicted human proteins 
-. that are not , found in the other currently se- 
\ quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- ~ 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP ; 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans ' 
when compared to the fly and worm. Some of * 
these relate to v the. prominent differences in . 
the immwe'^syst^-'Jiemostois,-' neuronal, 
vascular, and cytoskeletal "complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be — 
compensated for by combinatorial diversity : r< 
generated at the levels of protein architecture, .*. 
transcriptional and translational control, post- ' 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 



:! 



^uclepprotems.JvTO 
• f :^ es : ;% nurnr^ 

> -in vthe ^^rm,;:twpitimes that of the fly, and 
^^about^me ; ^e^as /^he A^dentified in ;the : . J ] 

^f.bfmbonucleoprotem-;^ ;in :humans\.con- 1 ^! 
tributes to gene regulation at either the splic- ' . A ' 
ying .or translational jevel is unknown. f ! 
\ j y Posttranslational ;mod$cations. ~ In this \ 
r^set o^processes, the -most 'prominent expan- '] 
; sion is the transglutaminases, calcium-depen- 
Vdent enzymes' that catalyze .the cross-linking 
of proteins in cellular processes such as he- 
mostasis and apoptosis {147). :The vitamin 
K- dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors/ 
.■ osteocalcin, and matrix GLA.protein {148). 
■; Tyrpsylprotein ?;sulfotransferases -participate" 
^m-me^posttransktionalmoc^ of.pro- 
tems,myolved:m;mflanim hemosta- 
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increase in the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- , - 

, ;plement (150), Evolution of apparentl^newv^^ ^- r- ,,,, : ■ • - ; ^ ^.^ ; - ,-;,.„■ 

v ;(fr( f! ^ perspective of sequence- 
,protem,domams^ 

^^comp exity by domain accretion both quantki^ ^ ; v , -J v ^S.v>:, - ; ^ ^ 7 ) : " \ i ^ \ ' ^ ' 
tatively and qualitatively (recruitment of nov^^^v^^- > ; '^nr^^ ^v. t^b -:-7 kMl^^,^?;^ .'-0 ' 

; el dnmninc .wi'tli ^ -. " : e 1 i-reiated . . ,25 ^4 r .; - g ^-^-^ v> -jq* j \ ;:>>' : r.Q 
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zmc' flnger^onta Histbne 
.where we~see-exp^ ;^Hist6ne;H2A 

^domains 'Per- : pix)t^- toiefe>^th^^' v;: ^? ne "? B 
>;brate-spec^ 

SCAN. Recent reports on the promiiient use OHomeoticf 
; of internal ribosomal entry sites in the human ADr " ° 

genome to, regulate translation' of specific 

classes of proteins suggests that this is an area 

that needs further Research to identify the full 

extent of this process m~thehum^^ 

(151). At the posttranslatiorial level, although 

we provide examples of expansions "of some 
protein families involved in these modifica- 
tions, further experimental evidence is re-' 
quired to. evaluate whether this is ' correlated 
with increased complexity in protein process- 
ing. Posttranscriptional processing and the' 
extent of isoform generation in the human 
remain to be cataloged in their entirety. Given 
the conserved nature of the spliceosomal ma- 
chinery, further analysis will be required to 
dissect regulation at this level. : / . V- 

8 Conclusions 
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ABD-B 
Bithoraxoid 
: t Iroquois class 
c Distal-less „ _ . ... 
-7 Engrailed ^ 
LIM-containing 
: MEIS/KNOX class 
• NK-3/NK-2 class 
I Paired box 
Six 

Leucine zipper 
Nuclear hormone receptorf 
Pou-related 
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f 8.1 The whole-genome sequencing 
approach versus BAC by BAC 

Experience in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizesjnd : repeaUonteny^ 
assess its strengths anSlveaknesses. With the ! - 
success of the method for a large number of 
microbial genomes, Drosophila, and now the 
human, there can be no doubt concerning the 
utility of this method. The large number of 
microbial genomes that have been sequenced 
by this method (75, 80,152) demonstrate that 
megabase-sized genomes can be sequenced 
efficiently without any input other that the de 
novo mate-paired sequences. Witfi more 
complex genomes like those of Drosophila or 
human, map information, in the form of well- 
ordered markers, has been critical for long- 
range ordering of scaffolds. For joining scaf- 
folds into chromosomes, the "quality of the " 
map (in terms of the order of the markers) is 
more important than the number of markers 
per se. Although this mapping could have 
been performed concurrently with sequenc- ' 
ing, the prior existence of mapping data was 
beneficial. During the sequencing of the Ar 
thaliana genome, sequencing of individual 
BAC clones permitted extension of the se- 
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; ' r ™r"™r-™^ of mRNA i'e 

;;: : Iiyer,:^ 

; unique regions of the genome. . 
;:v. size, and more importantly the r : 

S ,,:ThecostandoveM : ^ 

...clone^ premise, and on, the basis of v^activity^^^ 

a.^one ^tegy.ibr available -mutation iate^ bylnSof^SnW 

: genome-se^encmg projects Specific B^^mat^^f^ loci, Mer^i 967;^ is unlikel/to be eielyLces^raK : 



i 7 T; j ~~. " , — .^^uu^^muic uiau ^uyu genes j J/.,An estimate oi content, vCpG: islands/ aid Rene's' iSS* { 
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quencing phase.. Our experience with human 
genome assembly suggests that this will require 
at least 3 X coverage of both whole-genome and 
BAC shotgun sequence data. . . . - 

8.2 The low gene number in humans 

We have sequenced and assembled —95% of 
the euchromatic sequence of K sapiens .and 
used a new. automated gene prediction meth- 



, the. theoretical maximum gene number were 
. based on ^simplified , ideas . of. genetic . load — 
that : all .genes -have a certain low - rate of 
^mutation to. a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
. cemible phenotypic perturbations. '= " 

ThevT.modest.^ number ' v of:' human . r genes 



^■ genome than previously thought (about 9%)/. 
and are me ; most - gene-dense fracfion, but ' 
contain only 25% of the genes, rather than the \* 
predicted -40%. The low G+C. L isochores 
make up 65% of the ^ genome; aid r 4 8°/o of the ' 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene ; dupli- v 
cation, has .been ^described as: the ^desertifi- 1 j 
cation'' ;of the ryertebratejgenbm i 



: i!n^^- P ^ ^ !f ^ ° f ^ ^™ c ^ sm > • * at generate - the; complexities, . gene ^ensity^ and iare :^esef accidehts of his- 
numan genes. This has provided a major sur- v„ inherent in human develonment hnH th* • t rtW u,v " .* j " „; « „ 



human genes. This has provided a major sur- 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever, 
the reasons for this current disparity, only, 
detailed annotation, comparative genomics 
(particularly using the Mus ,musculus ge- ; 
nome), and. careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
following contribute to an inflated gene num- 
ber derived from ESTs: the variable lengths 
of 3'- and 5'-untranslated leaders and trailers; 
the littlerunderstood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced condition; the finding that nearly . 
40% of human genes are alternatively spliced 
(153); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



- J inherent in human development and the so- 
phisticated signaling systems that maintain 

• homeostasis. \There are a large number of 
ways in which the .functions of individual 

* genes . and gene, products . are . regulated. . The 
y degree of "openness" of chromatin structure- 
v., and hence transcriptional activity is regulated 
i by . protein .'complexes .that., involve ; histone ' 

and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved in nuclear regulation in Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal transduction events as well as by the 
tissue-specific expression of many of these 
proteins. Equally, important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (157); meth- 
. ylation of CpG islands in imprinting (158); 
and promoter-enhancer and intronic regions 
that modulate transcription. The splicepsomal 
machinery consists of multisubunit proteins 
(Table 19) as well as structural and catalytic 
RNA elements (159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other, struc- 



tory or driven by selection and evolution? If : 
these deserts are dispensable, it ought to be 
-possible to find marnmalian genomes that are 
; far smaller in "size than the human genome. 
, /Indeed, many vspecjes :of bats have genome _ 
f/sizes .mat are much smailer than that of hu- 
>mans; foKexar^ a species of 

rltalian'bat^'has /a genome' size that is only 
50% that of humans (164). Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is ~70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed. Although we have identi- 
fied and mapped more than 3 million SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a fraction of the .SNPs ;present in the 
human^pjpiriation as a whole; Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift The availability of a dense array of SNPs 
will allow questions related to each of these 



tural RNAs to appreciate their precise role in - factors to be addressed on a genome-wide basis, 
regulating gene expression. The phenomenon SNP studies can establish the range of haplo- 
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types present in subjects of different ethnogeo-,; then docks on this, and then the! complex 
graphic origins, providing insights into popula- moves there. .,. - {167) to the exciting area 
toon history and migration patterns.. Although ,, .,,of. : network :-iperturbatio'ns, ; nonlinear /.re- 

SUCU StlldieS HP VP Cliff 0<»CtpH tVi -at mrt/1«vm ltn mn _ - - < * * , , ' .\ 



8.5 Beyond single components 

While few would disagree with the intuitive 
.conclusion?^ was more 

tcompJex-M com- , 

rparisons siich 'as -whether tie set of predict^fck 

SNP.maps^wiU be needed to settle these con- ^£ systems, jtie ther gene number,: neuron iiumber- ^degree; ^^o^trai^tfbrkid^ince orotein ' 

troversies- m addinon to providmg evidence for ^ 
'.populanon^^ admix r . 
, ture, SNPs ;can serve as .markers for the 'extent -. '.' si 
.ofeyph^onarycons^ 

^W^e^ecially "^mato^iden-^ 

,„ V u t. T • • v h ^ M »¥ 1 4from.comparatoVe mammalian neu- . vic .approach .to me analysis . of biological svs- ' 

gions may.have lower SNP density because : ; ylar neuroanatomies.:. Fdrs example, ;^yhen We^or^is^fhrbugh^ The de- 

JiSST, w". IT 56 r 1 ° f Py^ynarmoset (which ds only 4 ^ ? men* of me sySen^ be^seked^ t ^ 

™1Z?J5 S It? T °? ° f ^ ^V^--*"**** less than that of a chimp , , - can self-organize, but more important, they ci , 
SJwft . ( i ^' m f ° f rand T gC - ^^ less than.that of humans. not 

netic drift also vanes widely across the ge-,- the neuroanatomies of aU three brains are strik- • " ' 

nome. The nonrecombining portion of the Y : : ; -mgly .sM 

. of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and 
-chimpanzees, the gene number, gene structures 
;;.and Jfunctipns,' .chromosomal and genomic or- 



chromosome faces the strongest pressure 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the . 
population as .there are autosomal chromo-' 

somes; and the level of polymorphism on the ganizations, and cell types and neuroanatomies 
Y; is correspondingly less. Similarly, the. X v „ are .almost. mdistmgm^hable,,yet ; the develop- 



due to redundancy, but is a property of inho-: 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price; they 
are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 



chromosome has a smaller effective popu- 
lation size than the autosomes, and its, nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 



mental .modificanons. that predisposed human ':■ 
: lineages to cortical expansion and development ■ 
of the larynx, giving rise , to language, culminat- 
ed in a massive ..singularity that by even the -. 
simplest of . criteria made humans more com- 



deleterious mutations may ; vary.. Regions of.— plex in a behavioral sense. 



■ network stability. Gene knockouts : provide ar^Jk 
• illustration: .Some knockouts may have minoflW 
^effects, whereas others have catastrophic effects 
on the system. :Li the case of vmentin, a sup- 
posedly critical component of the cytoplasmic 
intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reprbductively normal, with no obvious pheno- 

A _ 



high density of deleterious mutations will, 
see a greater rate of elimination by . selec- 
tion, and the effective population size will 
be smaller (1 66). As a result, the density of 
even completely neutral SNPs will be lower 
in such regions. There is a large literature 
on the association between SNP density 
and local recombination rates in Drosoph- 
Ha, and it remains an important task to 
assess the strength of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. 

8.4 Genome complexity 

We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



~ : -t r tymc. etTe^ts^ (^72)rand yef the;us^y 



Simple examination of the number.of neu- ^ ^vUoiis; vimentm ;netwbrkas l c^ absent 



rons, . cell c.types,*:or . genes or of^the genome 
size does not alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 
that result in such great variation/In addition, 
it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that are significantly increased in 
the human genome compared with the fly and 
worm. These include extracellular ligands 
and their cognate receptors (e.g., wnt, friz- 
zled, TGF-3, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeodomain transcription factor families), 
where a few, proteins control broad develop- 
mental processes. The answers . to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



On the other hand, ; —30% fbf knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenotypic normalcy ensues, given the 
appropriate genetic background. Thus, there are 
no "good" genes or (t bad" genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets that spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity," particularly , because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. : - 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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nome would open' up new strategies for hu- 
man biological research and .would have a 
major impact on medicine, and through med- 
icine and public health, on society. Effects on 
biomedical research ?are^already being felt. 
^..This ■ assembly ofjthe human v genome se- 
■M quencejs but ,a ; first^hesitant; step* oii aMong 

; ---the; role of : the genome in*human',bioio'iy; It 
, r ;has been possible only I because : of innpva- 
v. r „ tions;-in vinstrumentatioii^an^ ;softoare;;that. 
: -p have allowed automation of almost every . step 
\ J P f *f .Process (. from; DNA preparation to an- 
; ^notation/ : The aext steps are clear: . We must 
; S define, me„cbmplexity; mat ensues when this 
•'Relatively modest set of about 30,000 genes is 
expressed. The sequence provides i .the frame- 
. :;.,work upon which all me"gehetics; ; bio'chem- 
; istry, physiology, and ultimately .phenotype 
' depend. :,It provides the. boundaries for scien- 
u ; tiflc inquiry. The sequence is only the first 
level of . understanding of the : genome. All 
genes and their . control -elements ;,must be 
identified; their functions, in concert as well 
as in isolation, defined; their, sequence varia- 
tion worldwide described; and the . relation 
.-.between genome variation arid specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. - 

Another paramount .challenge t awaits: 
ublic discussion of this ^mformation and its 
tential for improvement of personal health, 
any diverse .sources of data' have": shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that .all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all . characteristics of the person 
are ''hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before our understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence; 
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mRNA database (NCBI)., Initial searches were per- 
= . formed on repeat-masked sequence with BLAST 2 0 
- [54) optimized - for' the .Compaq Alpha compute- 
server and an effective database size of 3 X 10 9 for 
BLASTN searches and 1 X 10* for BLASTX searches. 
Additional processing of each query-subject pair 
was performed to improve the alignments. All pro- 
tein BLAST results having an expectation score of 
<1 X 10~* human nucleotide BLAST results having 
an expectation score of <1 x 10~ e with >94% 
Identity, and rodent nucleotide BLAST results having 
an expectation score of <1 x 10 s with >80% 
identity were then examined on the basis of their 
high-scoring pair (HSP) coordinates on the scaffold 
to remove redundant hits, retaining hits that sup- 
. ported possible alternative splicing. For BLASTX 
searches, analysis was performed separately for se- 
lected model organisms (yeast mouse, human, C 
elegans. and D. melanogaster) so as not to exclude 
HSPs from these organisms that support the same 
gene structure. Sequences producing BLAST hits 
judged to be informative, nonredundant, and suffi- 
ciently similar to the scaffold, sequence were then 
realigned to the genomic sequence with Sim4 for 
ESTs. and with Lap for proteins. Because both of 
these algorithms take splicing into account, the 
resulting alignments usually give a better represen- 
tation of intron-exon boundaries than standard 
BLAST analyses and thus facilitate further annota- 
tion (both machine and human). In addition to the 



* one another.; Next, the resulting BLAST reports are 
- :'P ara ep\' and a graph Js created wherein each protein 
; constitutes a node; any hit between two proteins 
with an expectation beneath a user-specified 
threshold constitutes an edge. Lek then uses this 
graph to compute a similarity between each protein 
pair fj in the context of the graph as a whole by 
simply dividing the number of BLAST hits shared In 
common between the two proteins by the total 
number of proteins hit by / and / This simple metric 
has several interesting properties. First, because the 
similarity metric takes into account both the simi- 
larity and the differences between the two sequenc- 
es, at the level of BLAST hits, the metric respects the 
multidomain nature of protein space. Two multido- 
main proteins, for Instance, each containing do- 
mains A and B, will have a greater pairwise similarity 
to each other than either one will have to a protein 
containing only A or B domains, so long as A-B- 
containing multidomain proteins are less frequent in 
the proteome than are single-domain proteins con- 
taining A or B domains. A second interesting prop- 
erty of this similarity metric is that ft can be used to " 
produce a similarity matrix for the proteome as a 
whole without having to first produce a multiple 
alignment for each protein family, an error-prone 
and very time-consuming process. Finally, the met- 
ric does not require that either sequence have sig- 
nificant homology to the other in order to have a 
defined similarity to each other, only that they 
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share at least one significant BLAST hit in commoa 
.This is an especially interesting property , of the 
- metric, because it allows the rapid recovery of proV 
r< tein families from the proteome for which no mul- 
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- probability of observing a duplicated set of. three 
genes ; in - two : different loca tions,- : where v the three 
r'genes occur across a spread of five positions in both 



- hers, ' and at least one from a multicellular eu- 
■£ karyote.the cluster was extended. For the extension 
vvstep/a hidden Markov Model (HMM) was trained for 



-*. * A**' 
»3 - » 



/■-tiple alignment is possible, thus providing a compu-.' ^ ;^>i matched sets in the. predicted protein set-is. approx 
.'-.•v-.tational basis for the extension of protein homology v imately (WJ36/A/ 2 =/36/N, a,value<Kl. Therefore,- 



searches beyond those of current HMM-*and profile- : 
based search methods. Once . the whole-proteome 



■ ■ v^.'w—i «»— • — ' ... — i -- ' ■ I ' * • 

locations, - z is ^ BS/N?;- 1 the: expected 'number; of ;such ^ >;*;-.^ : .jr.the> duster, -rusing^the- SAM software package, ver 
- ■ " ' ' -* lesion 2/. The HMM was then scored against CenBank 

J.h NR. (excluding mutants but including fragments for 



:any such duplications, of. three genes are unlikely to 
.result from random rearrangements of the genome. If 



'^-/.v-ithis step),'- and all sequences scoring better than a 




1 
1 



...specific (NLUNULL) score were added to the cluster. 

then! retrained (with -fixed model 
[ sequences In the cluster were aligned . 
:o prod uce a multiple sequence align- : W " ; 

1 - ; between two sequences. ' Next, these single-linkage m> ing :candidateVduplications:; only 4 generates/matched ^ii^&ment^ 
' dusW : are fur^ 

" each member of which shares, a; user-specified pa1r-i#iri^^ 



'^'SngM^ :and ^g ener aNi.e^describing ^ 

."complete -clusters^' e.g., those Jsubdusters for • . 411 (1999).-- ]*• ^ V /./r?:- - v - : ^^V^" the .entire; cl 



which" everyrmember has a similarity metric of 1 to 
every other member, of. the subduster. We'believe 
\ that the single-linkage and complete dusters are of 
.'special interest,* In part,- because they allow us to 
^ estimate and to compare sizes of core protein sets 
;.In a rigorous manner. -The rationale for this is as 
follows: if. one imagines for ,a momenta perfect 
clustering algorithm capable of perfectly, partition- 
ing one. or more perfectly annotated protein sets 
Into protein families, it is reasonable to assume that 
the number of clusters will always be greater than. 
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or equal to, the number of single-linkagfe clusters. ^>103.*J; Zhang, T. L Madden. Genome Res. 7, 649 (1997); -v 



because single-linkage clustering is a maximally ag 
.' glomerative clustering method. Thus, If there exists 
. a single protein In the predicted protein set contaln- 
Ing domains A and B, then it will be clustered by 
■ '. single linkage together with all single-domain pro- 
teins containing domains A or B. Likewise, for a 
predicted protein set containing a single muttido- 
. main protein, the number of real clusters must 
always be less than or equal to the number of 
complete clusters, because it is impossible to place 
; ; „ : a unique multidomaih protein Into a complete ejus- , 
,ter. Thus, the' single-linkage and complete clusters, 
plus singletons should comprise a lower arid upper 
bound of sizes of .core protein sets, respectively, : 
allowing us to compare the relative size and com- 
plexity of different organisms* predicted protein set 

90. T. F. Smith, M. S. Waterman,;. Mot. Biol. 147. 195 
(1981). 
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. 93. The probability that a contiguous set of proteins Is 
the result of a segmental duplication can be esti- 
mated approximately as follows. Given that protein 
A and B occur on one chromosome, and that A' and 
B' (paralogs of A and B) also exist In the genome, 
the probability that B' occurs Immediately after A' 
Is VN. where N is the number of proteins In the set 
(for this analysis, N = 26,588). Allowing for B' to 
occur as any of the next J-1 proteins peaving a gap 
between A* and B' Increases the probability to [J - 
1)/N; allowing B'A' or A'B' gives a probability of 2[j 
- 1)/Af]. Considering three genes ABC the probabil- 
ity of observing A'B'C' elsewhere In the genome, 
given that the paralogs exist Is VN \ Three pro- 
. telns can occur across a spread of five positions In 
six ways; more generally, we compute the number 
of ways that K proteins can be spread across J 
positions by counting alt possible arrangements of K 
'- 2 proteins In the 7-2 positions between the first 
and last protein. Allowing for a spread to vaiy from 
K positions (no gaps) to y gives 

X-JC-2 S 

arrangements. Thus, the probability of chance occur- 
rence Is i/W^ 1 . Allowing for both sets of genes (e.g^ 
ABC and A'B'C) to be spread across J positions 
increases this to L 2 /N*^\ The duplicated segment 
might be rearranged by the operations of reversal or 
translocation; allowing for M such rearrangements 
gives us a probability P = Z^M/N*" 1 . For exampte. the 
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is a binomial sampling of the two homologs for each 
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